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FOREWORD 


This  report  was  prepared  by  the  Engineering  Psychology  Branch 
of  the  Behavioral  Sciences  Laboratory,  Aerospace  Medical  Division, 
Wright  Air  Development  Division.  The  work  began  as  the  responsibility 
of  the  Controls  Section,  under  Research  and  Development  Task  Number 
7182  -  71514  with  James  V.  Bradley  acting  as  Task  Scientist.  It  con¬ 
tinued  under  subsequent  Research  and  Development  Task  7184  -  71581 
and  was  finished  by  the  author  as  a  member  of  the  Maintenance  Design 
Section.  The  manuscript  was  typed  at  the  Aviation  Psychology  Project, 
Miami  University,  under  Contract  Number  AF  33(6l6)-5624,  under  the 
technical  supervision  of  Dr.  Clarke  W.  Crannell  and  Dr.  S.  A.  Switzer. 

The  material  included  is  the  result  of  a  review  of  the  literature 
begun  early  in  1955  with  the  approval  of  Mr.  John  W.  Senders,  then 
Section  Chief  of  the  Controls  Section,  and  ending  early  in  1958.  The 
author  was  greatly  aided  in  this  effort  by  I.  R.  Savage's  "Bibliography 
of  Nonpar  am  etric  Statistics  and  Related  Topics",  by  hundreds  of 
statisticians  and  institutions  sending  reprints  and  by  the  encouragement 
of  his  colleagues.  He  is  particularly  indebted  to  Dr.  Philburn  Ratoosh 
who  critically  reviewed  the  next-to-final  draft,  to  Dr.  Virginia  L. 
Senders  and  Dr.  Harry  J.  Jerison  whose  constant  interest  helped  the 
author  to  maintain  momentum,  and  to  Mr.  John  W.  Senders,  Dr.  H.  R. 
van  Saun,  Dr.  John  P.  Horns eth  and  Major  Leroy  Pigg  who,  as  Section 
Chiefs,  exercised  their  administrative  powers  in  support  of  the 
undertaking. 


ABSTRACT 


As  a  result  of  an  extensive  survey  of  the  literature,  a  large  number 
of  distribution-free  statistical  tests  are  examined.  Tests  are  grouped 
together  primarily  according  to  general  type  of  mathematical  derivation 
or  type  of  statistical  "information"  used  in’ conducting  the  test.  Each 
of  the  more  important  tests  is  treated  under  the  headings:  Rationale, 
Null  Hypothesis,  Assumptions,  Treatment  of  Ties,  Efficiency,  Appli¬ 
cation,  Discussion,  Tables,  and  Sources.  Derivations  are  given 
and  mathematical  interrelationships  among  the  tests  are  indicated. 
Strengths  and  weaknesses  of  individual  tests,  and  of  distribution-free 
tests  as  a  class  compared  to  parametric  tests,  are  discussed. 
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CHAPTER  I 


INTRODUCTION 


Figure  1  >  Radically  nonnormal  distribution  obtained  in  a  routine 
experiment  by  the  author.  (Histogram  is  based  on  2520  scores; 
smooth  curve  is  normal  distribution  with  same  mean,  variance  and 
area  as  histogram), 

1 ,  History 

Althoughnonpar ametric  statistics  can  be  traced  as  far  back 
as  1710,  when  John  Arbuthnott  attempted  to  prove  the  wisdom  of  Divine 
Providence  using  the  statistical  Sign  test,  the  preponderance  of  such 
tests  are  of  quite  recent  origin.  Van  Dantzig  and  Hemelrijk  (7)  dis¬ 
tinguish  four  stages  of  statistical  development.  In  the  first  or  one- 
parameter  stage  statistical  quantities  were  considered  to  be  constants 
such  as  the  ratio  of  the  yearly  number  of  deaths  to  number  of  living. 

In  the  Second  or  two-parameter  stage  variability  was  recognized  as  a 
factor  and  it  was  believed  that  empirical  distributions  could  be  described 
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by  stating  the  mean  and  variance,  the  parent  distribution  being  assumed 
to  be  a  normal  distribution.  In  the  third  or  multiparameter  stage,  uni¬ 
versal  normality  was  no  longer  an  article  of  faith,  but  it  was  believed 
that  an  empirical  distribution  could  be  described  by  identifying  its  mo¬ 
ments  in  the  assumption  that  "statistical  phenomena  were  governed  by 
laws  of  general  validity  albeit  that  they  showed  somewhat  greater  com¬ 
plexity  than  just  the  normal  law.  "  The  various  Types  of  Pearsonian 
Curve  were  a  product  of  this  phase.  In  the  fourth  or  no -parameter  phase 
efforts  to  identify  parameters  of  a  parent  population  in  order  to  be  able 
to  specify  its  probability  law  were  largely  replaced  by  attempts  to  deter¬ 
mine  "exact  relations,  valid  for  restricted  sample  sizes,  "  Savage  (38) 
places  the  "true  beginning"  of  nonparametric  statistics  in  1936,  zind  it  is 
indeed  at  about  this  time  that  it  begcin  to  take  the  form  of  a  separate 
statistical  discipline.  The  rapid  growth  of  activity  in  this  field  since 
that  date  can  be  inferred  from  Figure  2  which  shows  the  proportion  of 


PROPORTION  OF  CONTENTS  OF  EACH  YEAR  OF  ANNALS  OF  MATHE- 


YEAR  OF  PUBLICATION 


Figure  2.  Twenty  year  growth  of  activity  in  the  area  of  nonpara 
metric  statistics  (as  broadly  defined  by  Savage). 
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articles  in  each  volume  of  the  Annals  of  Mathematical  Statistics  which 
are  listed  in  Savage^ s  ’’Bibliography  of  Nonpar ametric  Statistics  and 
Related  Topics”. 


2.  Definitions 


The  terms  ’’nonparametric”  and  ’’distribution-free”  are  neither 
semantically  satisfactory  nor  synonymous.  This  matter  has  been  dis¬ 
cussed  at  length  by  Kendall  and  Sundrum  (28)  who  have  attempted  defini¬ 
tions  of  the  terms  which  reflect  the  theoretical  limitations  of  the  tests 
to  which  they  are  commonly  applied.  Popular  usage,  however,  has 
equated  the  terms  and  they  will  be  used  interchangeably  throughout 
this  report.  Grossly  speaking,  a  nonparametric  test  is  one  which 
makes  no  hypothesis  about  the  value  of  a  parameter  in  a  statistical 
density  function,  while  a  distribution-free  test  is  one  which  makes  no 
assumptions  about  the  precise  form  of  the  sampled  population.  Fre¬ 
quently  the  assumption  is  made  that  it  is  continuously  distributed  and 
sometimes  more  elaborate  assumptions  are  made  such  as  the  assump¬ 
tion  that  the  sampled  populations  have  identical  shapes  or  distributions 
symmetrical  about  the  same  point.  However,  the  assumptions  are 
never  so  elaborate  as  to  imply  a  population  whose  distribution  is  com¬ 
pletely  specified.  The  term  distribution-free  is  somewhat  deceptive, 
however.  The  reason  that  no  elaborate  assumptions  are  made  about  the 
distribution  of  population  magnitudes  is  very  simple:  the  magnitudes 
are  not  used  as  such  in  the  test.  Instead,  the  ranks,  ordinal  position, 
frequency  or  some  such  attribute  of  the  original  observations  provide 
the  ’’information”  used  by  the  test  statistic.  And  of  course  the  ’’popu¬ 
lation”  distribution  of  the  attribute  used  must  be  known  exactly  for  the 
conditions  stated  in  the  null  hypothesis,  just  as  must  the  population  dis¬ 
tribution  of  magnitudes  in  classical  statistical  tests.  An  important  dis¬ 
tinction  should  be  made,  however.  While  both  parametric  and  nonpara¬ 
metric  tests  require  that  the  form  of  a  distribution  be  fully  known,  that 
knowledge,  in  the  parametric  case,  is  generally  not  forthcoming  and  the 
required  distribution  of  magnitudes  must  therefore  be  ’’assumed”  or 
inferred  on  the  basis  of  approximate  or  incomplete  information.  In 
the  nonparametric  case,  on  the  other  hand,  the  distribution  of  the 
attribute  is  usually  known  precisely  from  a  priori  considerations  and 
need  not,  therefore,  be  ’’assumed.  ”  The  difference,  then,  is  not  one 
of  requirement  but  rather  of  what  is  required  and  of  certainty  that  the 
requirement  will  be  met. 

Because  they  do  not  use  magnitudes  as  such,  distribution-free 
tests  do  not  test  for  parameters  computed  from  them  in  the  same  sense 


3 


that  classical  tests  test  for  equal  means,  say,  or  identical  variances. 
Instead,  the  analogous  distribution-free  tests  might  test  for  equal  medians 
or  identical  interquartile  ranges,  i.  e.  ,  values  which  can  be  computed 
from  nonmagnitudinal  attributes  such  as  frequency,  or  position  in  rank 
order.  Of  course,  a  distribution-free  test  may  be  indirectly  a  test 
for  parameters  based  on  magnitudes ;  for  example,  if  symm.etrical  pop¬ 
ulations  can  be  assumed,  then  a  distribution-free  test  for  equal  medians 
becomes,  in  addition,  a  test  for  equal  means. 

Although  distribution-free  tests  generally  are  not  based  directly 
upon  the  magnitudes  of  the  original  observations,  results  by  Stuart  (46, 

47)  suggest  that  inferences  from  some  such  tests  may  be  extended  to 
the  original  magnitudes  with  a  high  degree  of  approximation.  Stuart 
found  very  high  correlations  between  observations,  from  either  the 
normal  or  theuniform  distribution,  and  their  ranks.  The  correlations 
were  respectively  .  94  and  .  96  for  samples  of  25  observations,  and  in¬ 
creased  with  increasing  sample  size  toward  limits  of  .98  and  1.00. 

The  existence  of  these  correlations  is  dependent  merely  upon  the  exis¬ 
tence  of  a  variance. 


3.  Distribution-Free  vs  Classical  Tests 


Both  distribution-free  and  classical  tests  have  points  of  super¬ 
iority,  and  which  type  of  test  should  be  used  depends  upon  a  number  of 
specific  conditions  as  well  as  upon  the  sophistication  of  the  user.  The 
comparison,  however,  is  generally  quite  favorable  to  distribution-free 
tests.  Some  advantages  and  disadvantages  of  distribution-free  rela¬ 
tive  to  parametric  tests  are  outlined  in  the  paragraphs  to  follow. 

a.  Simplicity  of  Derivation.  Most  distribution-free  tests  can 
be  derived  using  simple  combinatorial  formulae,  while  the  derivation 
of  classical  tests  requires  a  level  of  mathematics  far  above  the  highest 
level  attained  by  the  typical  research  worker.  However,  the  logic 
and  appropriateness  of  a  testes  application,  the  assumptions  it  makes, 
and  its  sensitivity  to  assumption  violation  all  hinge  upon  its  derivation. 

If  the  research  worker  understands  the  derivation,  he  can  deduce  or 
infer  much  of  this  necessary  information  for  almost  any  application  he 
may  contemplate,  thus  operating  with  a  maximum  of  comprehension  and 
flexibility.  If  he  does  not  understand  it,  he  is  reduced  to  the  uncom¬ 
prehending  **cookbook”  procedures  of  performing  tests  by  following 
a  paradigm  while  obeying  certain  highly  overgeneralized  rules  of  thumb. 
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In  the  opinion  of  the  writer  this  simplicity  of  derivation  is  by  far  the 
most  important  advantage  of  distribution-free  statistics  since,  for 
research  workers  ignorant  of  higher  mathem.atics ,  it  replaces  a 
mystery-cloaked  ritual  with  a  truly  scientific  procedure. 

b.  Ease  of  Applicatioii..  The  mathematical  operations  re¬ 
quired  in  computing  the  test  statistic  are  generally  much  less  involved 
for  distribution-free  than  for  paramietric  statistics.  Frequently  all 
that  is  required  is  counting,  or  adding,  subtracting  and  ranking.  This 
simplicity  of  application  is  obviously  an  economic  advantage,  permitting 
lower-paid,  mathematically  naive  personnel  to  be  employed  to  reduce 
data  and  perform  computations. 

c.  Speed  of  Application.  When  samples  are  of  small  or 
moderate  size,  distribution-free  methods  are  generally  faster  than 
parametric  techniques.  This  saving  in  com.putation  time  may  be 
used  to  obtain  more  data,  thus  frequently  cancelling  any  advantage 
the  parametric  test  may  have  in  terms  of  statistical  efficiency.  When 
samples  are  large  (  say  30)  distribution-free  tests  involving 
simple  counting  are  generally  faster,  while  those  involving  ranking 
may  prove  considerably  more  time  consuming,  than  standard  classical 
tests.  And  if  a  large  number  of  similar  tests  are  to  be  performed 
using  an  electronic  computer,  rather  than  a  desk  calculator,  para¬ 
metric  tests  are  probably  faster  at  all  sample  sizes. 

d.  Statistical  Efficiency.  As  indicated  in  the  preceding  para¬ 
graphs,  when  judged  by  the  practical  criterion  of  the  total  amount  of 
human  effort  required  to  conduct  an  experim.ent  and  analyze  its  results, 
distribution-free  tests  are  frequently,  if  not  generally,  more  efficient 
than  their  parametric  counterparts.  When  judged  by  the  mathematical 
criterion  of  statistical  efficiency,  distribution-free  tests  are  often 
superior  or  equal  to  their  most  efficient  parametric  counterparts  when 
both  tests  are  applied  under  "nonparametric"  conditions,  i.e.  ,  condi¬ 
tions  meeting  all  assumptions  of  the  distribution-free  test,  but  failing 
to  meet  some  of  the  assumptions  of  the  parametric  test.  When  both 
tests  are  applied  under  "parametric"  conditions,  i.  e.  ,  conditions 
meeting  all  assumptions  of  the  parametric  test,  and  therefore  of  both 
tests,  distribution-free  tests  are  very  slightly  less  efficient  (i.e., 
have  relative  efficiencies  a  shade  less  than  1.00)  at  extremely  small 
sample  sizes,  becoming  increasingly  less  efficient  as  sample  size 
increases.  When  sample  size  becomes  infinite,  distribution-free 
tests  generally  have  their  lowest  efficiencies  relative  to  the  most 
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efficient,  comparable  parametric  test.  This  efficiency  value  may  be 
as  high  as  .  955  or  as  low  as  zero,  depending  on  the  test. 

e.  Scope  of  Application.  Because  they  are  based  on  fewer 
and  less  elaborate  assumptions  than  classical  tests,  distribution-free 
tests  can  be  legitimately  applied  to  a  much  larger  class  of  populations. 


f.  Susceptibility  to  Violation  of  Assumptions.  Obviously  the 
more  elaborate  the  assumptions  the  fewer  the  number  of  situations  which 
meet  them,  and,  in  this  sense,  parametric  assumptions  are  the  more 
susceptible  to  violation.  For  example,  the  parametric  assumption  of 
normality  requires  that,  in  addition  to  being  continuously  and  symmetri¬ 
cally  distributed  (as  might  be  assumed  by  nonparametric  tests),  the 
population  must  also  be  bell-shaped,  since  these  are  all  features  of 

a  Gaussian  distribution. 

g.  Detectability  of  Violations  of  Assumptions.  When  the  non¬ 
parametric  assumption  of  continuous  distributions  is  violated,  both  the 
fact  and  the  degree  of  the  violation  are  readily  apparent  from  the  exist¬ 
ence  of  tied  scores  in  the  obtained  data.  No  such  obvious  indication 
advises  the  experimenter  that  a  parametric  assumption  has  been  vio¬ 
lated.  Of  course  he  may  apply  tests  for  normality  or  homogeneity  to 
the  obtained  data,  but  such  tests  are  rather  unsatisfactory.  They 

are  unliicely  to  detect  any  but  the  most  extreme  violations  when  samples 
are  small,  and  they  are  almost  certain  to  detect  even  the  most  trivial¬ 
ly  slight  violations  when  samples  are  very  large. 

h.  Effect  of  Assumption  Violations.  *  Although  much  has  been 
written  about  the  robustness  of  classical  tests  and  their  insensitivity  to 
violation  of  assumptions,  this  claim  actually  rests  upon  a  multitude  of 
qualifications  which  rarely  accompany  it.  The  writer  has  obtained 
completely  natural  and  uncontrived  experimental  data  which,  by  vio¬ 
lating  a  single  parametric  assumption,  rendered  a  standard  parametric 


This  topic  is  discussed  at  length  in  two  WADC  Technical  Reports 
shortly  to  go  to  press:  Bradley,  J.  V.  ,  Studies  in  research  method¬ 
ology.  I:  Compatability  of  psychological  measurements  with  para¬ 
metric  assumptions.  ,  and  Bradley,  J.  V.  ,  Studies  in  research  method¬ 
ology  n;  Consequences  of  violating  parametric  assumptions  -  fact  and 
fallacy. 
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test  completely  powerless,  at  reasonable  sample  sizes  and  standard 
significance  levels,  to  reject  any  of  a  wide  range  of  false  hypotheses. 

The  fact  is  that  any  violation  of  assumptions  can  be  expected  to  alter 
the  distribution  of  the  test  statistic  and  change  the  value  at  which  the 
test  statistic  becomes  significant.  Whether  or  not  this  effect  is 
negligible  depends  not  only  upon  the  degree  to  which  the  assumption 
is  violated  but  also  upon  extrinsic  factors  such  as  sample  size  and 
significance  level.  This  is  true  of  both  parametric  and  distribution- 
free  tests. 

In  the  nonparametric  case,  the  effects  of  violation  of  the  con¬ 
tinuity  assum*ption  can  be  mitigated  by  applying  certain  methods  of 
dealing  with  tied  scores;  in  the  parametric  case,  the  effect  of  non- 
norm.ality  can  be  reduced  by  use  of  transformations,  but  at  considerably 
greater  expenditure  of  time. 

i.  Type  of  Measurements  Required.  Measurements  on  an 
interval  or  ratio  scale  are  generally  required  by  classical  tests.  How¬ 
ever,  distribution-free  tests  have  greater  versatility.  They  generally 
require  measurements  on  at  least  an  ordinal,  or  sometimes  a  nominal, 
scale  but  can  be  used  with  mieasurements  from  any  higher  order  scale. 
They  are,  of  course,  the  only  truly  appropriate  tests  when  original 
scores  exist  in  the  natural  form  of  ranks  or  small  frequencies. 

j.  Logical  Validity  of  Rejection  Region.  The  distribution  of 

a  classical  test  statistic  is  usually  continuous,  increasing  or  decreasing 
smoothly,  without  fluctuation,  except  for  a  possible  change  of  direction 
at  a  single  mode.  Unfortunately  the  point  probability  of  a  nonparametric 
test  statistic  does  not  necessarily  always  increase  as  the  test  statistic 
approaches  its  m.ost  probable  value.  It  may  level  off  or  even  dip  before 
resum.ing  its  climib.  This  characteristic,  when  it  exists,  may  be  decided¬ 
ly  em.barrassing  when  the  rejection  region  for  a  distribution-free  test 
is  selected,  on  an  intuitive  basis.  Should  the  rejection  region  be  chosen 
as  the  cumulative  probability  for  those  values  of  the  test  statistic,  which 
are  least  likely,  or  those  which  are  most  distant  from  the  expected 
value  of  the  test  statistic? 

k.  Types  of  Statistics  Testable.  Statistics  defined  in  term.s 
of  arithmetical  operations  upon  observation  magnitudes  can  be  tested 

by  classical  techniques,  while  those  defined  by  order  relationships  (rank) 
or  category-frequencies  can  be  tested  by  distribution-free  methods. 
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Means  and  variances  are  examples  of  the  former,  medians  and  exceed¬ 
ances  of  the  latter.  The  two  approaches  are  different,  but  neither  is 
superior;  both  types  of  statistic  have  their  advantages. 

1.  Testability  of  Higher  Order  Interactions.  Higher  order 
interactions  can  be  tested  with  ease  by  classical  methods.  However, 
there  are  few  distribution-free  tests  for  higher  interactions  and  they 
are  awkward  and  limited  in  application. 

m.  Choice  of  Significance  Level.  The  distribution  of  the 
test  statistic,  when  the  null  hypothesis  is  true,  is  usually  continuous 
for  classical  tests  and  discrete  for  distribution-free  tests.  This  means 
that,  for  any  designated  significance  level  oc  ,  a  value  of  the  classical 
statistic  can  be  found  whose  cumulative  probability  is  exactly  oc  while, 
for  the  distribution-free  test,  such  a  value  of  the  test  statistic  usually 
does  not  exist.  Thus  when  using  a  classical  test  the  research  worker 
may  choose  any  significance  level  he  wishes,  while,  when  using  a  dis¬ 
tribution-free  test,  he  must  either  accept  one  of  the  discrete  cumulative 
probabilities  of  the  test  statistic  as  his  significance  level,  or  he  must 
apply  the  test  inexactly,  using  as  significance  level  a  cumulative  prob¬ 
ability  which  the  test  statistic  cannot  actually  assume  and  rejecting 
whenever  it  is  found  to  have  a  smaller  cumulative  probability.  The 
latter  choice  is  often  forced  upon  him  by  inexact  tables  of  probabilities 
which  list  values  of  the  test  statistic  which  are  "significant”  at  the 
standard  significance  levels,  .05,  .01  and  .001. 

n.  Influence  of  Sample  Size.  The  size  of  the  sample  upon 
which  they  are  to  be  used  is  an  extremely  important  factor  in  deter¬ 
mining  the  relative  merits  of  distribution-free  and  classical  tests. 

When  samples  are  small  (say  N  ^  10)  distribution-free  tests  are  easier, 
quicker  and  only  slightly  less  efficient  even  if  all  assumptions  of  the 
parametric  test  have  been  met.  At  these  sample  sizes,  violations 

of  parametric  assumptions  generally  have  their  most  devastating  effect, 
yet  are  most  unlikely  to  be  detected.  Therefore,  unless  the  experiment¬ 
er  has  a  priori  knowledge  that  all  parametric  assumptions  have  been 
met,  the  wiser  choice  would  generally  appear  to  be  a  distribution-free 
test.  When  samples  are  large  (say  N  >  30);  some  distribution-free 
tests  still  compare  favorably  with  their  parametric  counterparts.  Others, 
however,  will  have  become  more  laborious  and  time  consuming,  and,  in 
contrast  to  parametric  tests  whose  assumptions  are  met,  their  calcu¬ 
lated  or  tabled  probabilities  may  be  only  approximate.  Finally,  their 
efficiency  relative  to  a  parametric  test  whose  assumptions  are  all  true 
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may  have  dropped  to  an  appreciably  low  level.  On  the  other  hand, 
appreciable  violations  of  parametric  assumptions  will  have  become 
more  readily  detectable  and,  in  many  cases,  their  effect  may  have 
become  negligible  due  to  the  effect  described  by  the  central  limit 
theorem.  At  large  sample  sizes,  therefore,  either  type  of  test  may 
be  superior;  however,  circumstances  are  much  more  favorable  to 
parametric  tests  than  is  the  case  whei;;  samples  are  small. 


4.  Organization  of  Material 

Certain  topics  appear  to  be  of  critical  importance  to  the  under¬ 
standing  and  application  of  distribution-free  tests.  These  topics  will  be 
discussed  in  a  general  way  in  the  following  paragraphs  and  the  same  topics 
will  form  the  paragraph  headings  under  which  each  of  the  m.ore  important 
distribution-free  tests  will  be  examined. 

a.  Rationale.  The  best  insurance  against  misapplication  is  a 
thorough  understanding  of  the  derivation  and  the  mathematical  logic 
upon  which  a  test  is  based.  The  hypothesis  which  can  be  tested,  the 
assumptions  which  must  be  made,  the  seriousness  of  various  degrees 
of  as  sumption- violation,  the  best  method  of  dealing  with  such  violations, 
the  efficiency  of  the  test,  the  situations  to  which  it  is  applicable  and  the 
exactitude  of  the  tables  or  of  the  probabilities  obtained  by  formula  all 
depend  upon  the  test's  derivation  and  can  either  be  directly  determined 
or  partially  inferred  from  a  knowledge  of  it.  Furthermore,  many  tests 
are  legitimately  applicable  in  situations  for  which  they  were  not  originally 
designed:  however,  the  experimenter  will  not  be  able  to  recognize  these 
situations  unless  he  understands  the  derivation.  Because  of  their  impor¬ 
tance,  therefore,  derivations  have  been  given  at  some  length.  An  effort 
has  been  made  to  use  the  simplest  mathematics  possible  and  to  present 
derivations  which  will  give  the  greatest  insight  into  the  logic  of  applica¬ 
tion  and  the  advantages  and  limitations  of  the  test.  For  this  reason, 
many  of  the  derivations  are  mathematically  inefficient  and  are  not  in 

the  form  in  which  they  are  found  in  the  literature. 

b.  Null  Hypothesis.  The  literature  on  a  test  frequently  does 
not  contain  an  explicit  and  precise  statement  of  the  tested  hypothesis. 
Instead  the  hypothesis  may  be  implicit  in  some  mathematical  manipu¬ 
lations,  it  may  be  vaguely  hinted  at,  or  it  may  be  stated  explicitly  but 
inaccurately,  generally  in  the  direction  of  overstatement.  A  major 
reason  for  these  difficulties  appears  to  be  the  lack  of  concise  verbal 
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terms  to  express  what  the  test  is  actually  doing.  In  order  to  avoid 
misleading  the  reader,  an  attempt  has  been  made  to  express  the  tested 
hypothesis  explicitly  and  precisely,  with  resort  to  expression  in  mathe¬ 
matical  terms  when  necessary. 

c.  Assumptions.  Assump+^ions  also  are  frequently  unstated, 
and  occasionally  misstated,  in  the  literature,  in  which  case  they  must 
be  inferred  from  the  derivation.  In  common  with  parametric  tests, 
the  assumptions  of  random  sampling  and  independent  observations  are 
usually  required.  These  assumptions  however  refer,  at  least  in  a 
sense,  not  to  characteristics  of  the  sampled  population  but  rather  to 
the  method  of  sampling.  Unlike  ^population”  assumptions,  their  valid¬ 
ity  can  generally  be  assured  by  adhering  rigidly  to  certain  prescribed 
sampling  and  experimental  procedures. 

Aside  from  the  above  one  of  the  commonest  nonparametric 
assumptions  is  that  the  sampled  populations  are  continuously  distrib¬ 
uted.  Such  a  population  has  an  infinite  number  of  abscissae  and  thus 
contains  an  infinite  number  of  different  score  magnitudes,  each  of  which 
has  zero  a  priori  probability  of  being  drawn.  Theoretically,  therefore, 
a  sample  from  a  continuously  distributed  population  will  contain  no  scores 
of  zero  and  no  tied  scores  since  zero  is  a  predesignated  score  and  since 
the  first-drawn  member  of  a  tied  group  can  be  considered  to  predesignate 
the  remainder.  Zero  scores  are  embarrassing  in  tests  using  the  alge¬ 
braic  sign  of  scores,  and  tied  scores  are  undesirable  in  tests  which  rank 
scores  and  whose  derivation  requires  that  each  rank  occur  only  once. 

The  assumption  of  continuity,  however,  is  an  unrealistic  one.  Even  if 
the  sampled  population  is  continuous,  measurements  made  upon  its 
members  must  be  discretely  distributed  since  no  measuring  instrument  is 
capable  of  infinite  precision.  Suppose  any  population  of  actual  measure¬ 
ments  to  be  transformed  into  measurements  on  a  scale  running  from 
zero  to  one  and  that  precision  is  possible  out  to  the  N-th  decimal  place. 
Then  the  population  of  measurements  is  a  discrete  population  whose 
interval  width  is  the  difference  between  successive  digits  at  the  N-th 
decimal  place.  The  assumption  of  continuous  distributions,  therefore, 
can  never  be  exactly  fulfilled  in  practice.  It  can  be  approximated  by 
taking  fine  measurements  from  distributions  representing  a  very  large 
number  of  distinguishable  values.  Fortunately,  the  degree  to  which 
the  continuity  assumption  is  violated  can  be  largely  inferred  from  the 
proportion  of  tied  scores  in  the  data.  Therefore,  although  unrealistic, 
this  assumption  has  the  advantage  that  its  violations  are  highly  detectable. 
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Another  assumption  frequently  encountered  is  that  the  sampled 
populations  have  identical,  but  unspecified,  shapes.  This  assumption 
is  found  in  tests  which  fail  to  reject  when  the  sampled  populations  are 
identical  but  which  may  reject  for  a  variety  of  reasons.  By  assuming 
identical  shapes,  rejection  may  be  attributed  to  nonidentity  of  location. 

It  is  to  be  noted  that  this  assumption  may  be  dispensed  with  if  the  test 
be  regarded  merely  as  a  test  for  identical  populations  against  the  broad 
alternative  of  nonidentical  populations. 

d.  Treatment  of  Zero  or  Tied  Scores.  As  mentioned  earlier 
some  tests  require  that  all  scores  have  an  algebraic  sign,  i.  e.  ,  that 
there  are  no  scores  of  zero  magnitude;  others  require  that  no  scores 
have  the  same  magnitude,  i.  e.  ,  that  there  are  no  ties  for  any  given  rank. 
Zero  and  tied  scores  do  sometimes  occur,  however,  and  several  methods 
of  dealing  with  them  have  been  suggested: 

(1)  Randomize.  Randomly  assign  a  plus  or  a  minus  to 
each  zero  score  (say,  on  the  basis  of  a  coin  toss);  or  randomly  assign 
to  scores  of  the  same  magnitude  the  ranks  they  would  have  if  not  tied, 
i.e.  ,  if  differing  very  slightly.  This  method  appeals  to  mathematicians, 
because  only  under  this  method  does  the  test  statistic  have  exactly  the 
same  distribution,  when  the  null  hypothesis  is  true,  that  it  would  have 

if  the  continuity  assumption  were  not  violated.  It  makes  little  sense 
experimentally,  however,  since  it  permits  an  additional  and,  in  a 
sense,  unnecessary,  element  of  pure  chance  to  help  determine  whether 
or  not  a  false  hypothesis  will  be  rejected. 

(2)  Minimize  the  Probability  of  Rejection.  Assign  all 
zero  scores  that  algebraic  sign  which  is  least  conducive  to  rejection 
of  the  null  hypothesis;  or  assign  ranks  to  tied  scores  in  the  way  least 
conducive  to  rejection  of  the  null  hypothesis.  This  is  the  conservative 
approach  and  it  alone  insures,  in  advance  of  sampling,  that  the  tested 
hypothesis  will  not  be  falsely  rejected  due  to  violation  of  the  assump¬ 
tion  of  continuity. 

(3)  Obtain  the  Average  Value  of  the  Test  Statistic.  Assign 
half  the  zeros  a  plus,  half  a  minus  sign;  or  assign  each  score  in  the  tied 
group'the  average  of  the  ranks  the  members  of  the  group  would  have  if 
not  tied.  The  latter  is  known  as  the  midrank  method.  It  results  in  a 
distribution  of  ranks  having  the  same  mean  but  somewhat  smaller 
variance  than  the  discrete  rectangular  distribution  of  integers  1  to  N. 

For  some  tests  a  ^correction  for  ties**  has  been  devised  for  use  with 
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the  midrank  method.  When  applied  to  asymptotic  form.ulae  for  the 
test  statistic  the  correction  compensates  for  the  reduction  in  variance 
due  to  the  use  of  midranks.  It  thus  tends  to  reestablish  the  validity 
of  the  test  in  the  large-sample  case.  The  logic  of  the  implicit  assump  ¬ 
tions  upon  which  this  correction  is  based  has  been  challenged.  (VII- 36) 
However,  the  correction  is  probably  an  improvement  in  any  case, 
although  perhaps  not  fully  restoring  the  test  to  exactitude. 

(4)  Obtain  the  Average  Probability.  Break  ties  in  all 
possible  ways,  calculate  the  test  statistic  and  obtain  its  probability  for 
each  way,  and  average  these  probabilities.  This  improves  on  the  above 
method  by  obtaining  the  average  probability  of  the  test  statistic,  rather 
than  the  probability  for  the  average  value  of  the  test  statistic,  averaging 
over  all  possible  ways  in  which  tied  measurements  could  have  been 
caused  by  truly  differing  scores.  It  is  time  consuming,  however,  and 
has  the  disadvantage,  in  common  with  the  preceding  method,  that  the 
average  of  all  possibilities  may  differ  greatly  from  that  one  possibility 
which  represents  the  true  state  of  affairs. 

(5)  Drop  Zeros.  Discard  zero  scores  and  reduce  N 
accordingly.  The  power  of  certain  tests  has  been  found  to  be  greater 
under  this  method  than  under  methods  (1)  or  (3).  However,  it  seems 
likely  that  this  is  an  artifact  attributable  to  an  unrecognized  and  spurious 
increase  in  the  probability  of  rejection  in  all  cases,  i.  e.  ,  when  the 
tested  hypothesis  is  true  as  well  as  when  it  is  false.  Zero  difference 
scores  lend  support  to  the  hypothesis  of  **no  difference.  **  Discarding 
them  elim.inates  data  favoring  the  null  hypothesis  and  permits  contrary 
data  to  assume  greater  weight,  thus  spuriously  increasing  the  probability 
of  rejection. 


A  final  method  is  to  calculate  the  test  statistic  twice, 
once  giving  all  ambiguous  data  (zero  or  tied  scores)  the  possible  true 
values  which  are  most  conducive  to  rejection,  once  giving  them  the 
values  least  conducive  to  rejection.  It  has  been  said  with  some  justi¬ 
fication,  that  if  in  both  cases  the  test  statistic  falls  within,  or  in  both 
cases  outside  of,  the  rejection  region  there  is  no  problem;  if  it  does 
not,  there  is  no  solution. 

e.  Efficiency.  Certain  mathematical  properties  of  a  test 
are  important  in  evaluating  its  usefulness.  The  power  of  a  test  is 
the  probability  of  its  rejecting  a  specified  false  hypothesis.  (It  is 
equal  to  1-p  where  |3  is  the  probability  of  committing  a  Type  II 


12 


error  -  failing  to  reject  a  false  null  hypothesis.  )  Power,  then,  depends 
upon  at  least  four  variables:  (a)  the  amount  by  which  the  hypothesis  is  in 
error,  i.  e.  ,  the  size  of  the  discrepancy,  6,  between  the  hypothesized 
and  true  condition,  (b)  the  size,  oc,  of  the  significance  level  chosen, 

(c)  the  location  of  the  rejection  region,  e.g.  ,  whether  the  test  is  one- 
tailed  or  two-tailed,  (d)  the  size,  N,  of  the  sample  used  in  the  test. 

A  power  function  is  a  curve  in  which  all  but  one  of  these  variables  are 
held  constant  and  power  is  plotted  as  ordinate  against  that  one  variable, 
usually  6^  as  abscissa.  A  test  of  a  given  true  hypothesis  is  most 
powerful  against  a  specified  alternative  hypothesis  if  no  other  test  of 
the  same  hypothesis  has  greater  power  against  the  same  alternative. 

If  it  is  most  powerful  with  respect  to  each  member  of  a  class  of  alter¬ 
native  hypotheses,  the  test  is  called  uniformly  most  powerful  against 
that  class  of  alternatives. 

A  test  is  unbiassed,  for  a  given  alternative,  if  the  probability 
of  rejecting  the  null  hypothesis  is  greater  when  the  alternative  hypothesis 
is  true  than  when  the  null  hypothesis  is  true. 

A  test  is  consistent  for  a  given  alternative  to  the  null  hypothe¬ 
sis  if,  when  that  alternative  hypothesis  is  true,  the  probability  of  re¬ 
jecting  the  false  null  hypothesis,  i.  e.  ,  the  power  of  the  test,  approaches 
1  as  the  sample  size,  N,  on  which  the  test  is  based,  approaches  infinity. 
The  test  is  consistent  with  respect  to  a  class  of  alternatives  if  it  is 
consistent  for  each  of  the  alternatives  of  which  the  class  is  composed. 

Efficiency  is  a  relative  term  comparing  the  sensitivity  of  a 
test  with  that  of  some  other  test,  usually  the  most  powerful  alternative 
available.  Let  A  and  B  be  statistical  tests  of  the  same  null  hypothesis 
against  the  same  set  of  alternative  hypotheses,  and  let  the  tests  use 
the  same  significance  level  and  the  same  number  of  tails.  Then  the 
efficiency  of  test  A  relative  to  test  B  can  be  interpreted  as  the  ratio 
b/a,  where  a  is  the  number  of  observations  required  by  test  A  to  equal, 
by  some  criterion,  the  power  of  test  B  based  on  b  observations.  There 
are  actually  a  number  of  definitions  of  efficiency,  differing  mainly  in 
the  criterion  by  which  the  two  powers  are  equated. 

Asymptotic  efficiency  is  usually  defined  in  terms  of  the  limiting 
value  of  the  ratio  b/a  as  b  approaches  infinity  and  is  therefore  relevant 
only  when  the  test  is  to  be  applied  to  very  large  samples.  It  has  the 
advantage  of  being  very  nearly  independent  of  the  exact  size  of  the 
samples  so  long  as  they  are  very  large.  The  more  common  definitions 
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of  asymptotic  efficiency  appear  to  be  equivalent.  Asymptotic  relative, 
efficiency,  abbreviated  A.  R.  E.  ,  and  sometimes  called  Pittman  effi¬ 
ciency,  is  defined  roughly  as  follows.  Let  A  and  B  be  two  consistent 
tests  based  uponaand  b  observations  respectively,  each  test  statistic 
being  asymptotically  normally  distributed.  Let  both  A  and  B  test  a 
null  hypothesis  Hq  against  an  alternative  hypothesis  H  at  a  signifi¬ 
cance  level  oc.  The  asymptotic  relative  efficiency  of  \  with  respect 
to  B  is  the  limiting  value  of  the  ratio  b/a  as  a  is  allowed  to  vary  in 
such  a  way  as  to  give  A  the  same  power  as  B  while,  simultaneously, 
b  approaches  infinity  and  H  approaches  H  .  The  purpose  of  the 
"approach**  of  to  Hq  is  to  prevent  the  ratio  b/a  from  assuming  a 
limiting  value  of  1  which  it  otherwise  would  do  since  at  extremely  large 
sample  sizes  the  power  of  a  consistent  test  against  a  fixed  alternative 
is  virtually  1.  The  method  of  obtaining  asymptotic  relative  efficiency 
has  been  shown  to  be  equivalent  (Stuart  V-50)  to  that  of  obtaining 
asymptotic  local  efficiency.  Let  A  and  B  be  one-tailed  tests  based  on 
a  and  b  observations  respectively  and  testing  the  same  null  hypothesis 
against  the  same  set  of  alternative  hypotheses  at  the  same  significance 
level.  Let  b  approach  infinity  and  vary  a  so  that  the  power  functions 
of  the  two  tests  have  equal  slopes  at  the  point  Then  the  limiting 
ratio  b/a  is  the  asymptotic  local  efficiency  of  test  A  relative  to  test  B. 
Somewhat  similar  methods  involve  taking  the  asymptotic  ratio  of  first 
derivatives,  i.  e.  slopes,  of  the  power  fimctions  at  the  point  In  the 

case  of  equal-tailed,  two-tailed  tests  this  is  zero  and  the  asymptotic 
ratio  of  second  derivatives  is  used.  Estimate  efficiency  is  obtained  by 
establishing  a  mathematical  equivalence  between  relative  efficiency  of 
two  tests  and  the  relative  efficiency  of  two  estimators  of  a  population 
parameter.  The  latter  requires  that  both  estimates  be  consistent  and 
asymptotically  normally  distributed  and  is  expressed  in  terms  of  the 
ratio  of  the  asymptotic  variances  of  the  two  estimators.  Estimate  ef¬ 
ficiency  is  therefore  an  index  of  relative  efficiency  for  the  case  where 
both  tests  are  based  upon  large,  i.  e.  ’^infinite**,  samples.  Stuart  (VI- 
26)  observes  that  estimate  efficiency  is  equivalent  to  asymptotic  relative 
efficiency.  All  of  the  asymptotic  efficiencies  defined  above  refer  to 
the  relative  power  of  two  tests  at  the  point  of  their  power  functions. 
The  efficiency  values  obtained  therefore  represent  the  effectiveness 
of  one  test  relative  to  another  when  the  true  condition  differs  negligibly 
from  the  hypothesized  condition,  i.  e.  ,  when  the  alternative  hypothesis 
lies  in  the  immediate  vicinity  of  the  null  hypothesis. 

Nonasymptotic  efficiencies  depend  upon  the  size  sample  upon 
which  the  test  is  based,  upon  the  location  of  the  rejection  region,  upon 
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the  size  cx:  of  the  significance  level  chosen,  and  upon  the  alternative 
hypothesis  or  set  of  alternative  hypotheses.  Balancing  the  disadvan¬ 
tage  that  nonasymptotic  efficiencies  are  highly  specific  to  experiment¬ 
al  test  conditions,  is  the  advantage  that  they  are  quite  realistic  to 
those  conditions.  While  asymptotic  efficiencies  provide  a  limiting 
value  for  a  testes  efficiency  at  infinite  sample  size,  this  value  is 
generally  m.uch  lower,  when  distribution-free  statistics  are  compared 
with  classical  tests,  than  is  the  efficiency  value  at  practical  sample 
sizes.  The  relative  efficiency  of  A  with  respect  to  B  is  simply  b/a 
where  a  is  the  number  of  observations  required  by  t  est  A  to  equal  the 
power  of  test  B  based  on  b  observations  when  both  statistics  test  the 
same  null  hypothesis  against  the  same  alternative  at  the  same  signi¬ 
ficance  level  (both  either  one-tailed  or  two-tailed).  The  power  effi¬ 
ciency  of  test  A  with  respect  to  test  B  (of  the  same  null  hypothesis 
at  the  same  significance  level  against  the  same  set  of  alternative 
hypotheses)  is  obtained  by  holding  a  constant  and  varying  b  until  the 
power  functions  of  the  two  tests  are  equated  in  the  sense  that  the  area 
between  the  power  functions  when  the  ordinate  for  test  A  exceeds  that 
of  test  B  equals  the  area  between  the  power  functions  when  the  reverse 
is  true.  The  value  taken  by  b  need  not  be  integral.  The  power  effi¬ 
ciency  of  A  relative  to  B  is  then  b/a.  This  definition  of  efficiency  has 
the  advantage  that  the  obtained  efficiency  values  are  peculiar  to  an 
entire  class  of  alternative  hypotheses  rather  than  to  a  specific  alter¬ 
native  hypothesis.  Its  disadvantage  lies  in  the  failure  of  statisticians 
to  agree  com.pletely  upon  the  precise  method  by  which  to  apply  it. 

Some  asymptotic  efficiencies  of  some  distribution-free  tests 
relative  to  their  classical  counterparts  are  given  in  Table  I.  All 
efficiencies  given  in  the  body  of  the  table  are  for  the  case  where  both 
tests  are  applied  under  conditions  satisfying  all  of  the  assumptions 
of  the  classical  test.  Except  when  otherwise  specified,  the  tests 
were  applied  to  normally  distributed  populations;  comparisons  in¬ 
volving  Student*s  t  required  that  the  two  populations  to  which  both 
tests  were  applied  have  equal  variances,  etc.  When  more  than  one 
efficiency  is  listed  in  a  cell,  the  asymptotic  efficiency  of  the  test  de¬ 
pends  upon  the  number  of  categories  or  groups  to  which  the  test  is 
applied.  An  asymptotic  efficiency  of  zero  requires  some  interpreta¬ 
tion.  It  means  that,  when  both  tests  are  based  upon  an  equal  and  **in- 
finite*’  number  of  observations,  the  test  with  zero  asymptotic  efficiency 
requires  **infinitely^*  more  observations  in  order  to  equal  the  power  of 
the  comparison  test.  It  does  not  mean  that  the  ratio  of  the  powers  of 
the  two  tests  is  zero  or  infinity.  The  power  of  any  consistent  test 
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TABLE  I 


EFFICIENCIES  OF  SOME  DISTRIBUTION -FREE  TESTS  RELATIVE  TO,  AND  UNDER 
CONDITIONS  ASSUMED  BY,  A  (MOST  POWERFUL)  CLASSICAL, COMPARISON  STATISTIC* 


Test 

Asymptotic 

Efficiency 

Established  by 

Footnotes 

Student's  t* 

1.000 

X-test 

1.  000 

van  der  Waercien 

Mann- Whitney 

.955 

Pitman,  Mood,  Dwass, 

van  der  Waerden 

C,  U.  1 

Sign 

.637 

Cochran,  Jeeves  Sc  Rich¬ 

ards,  Dixon,  Walsh 

C 

Westenberg  Median 

.637 

Mood 

No.  Runs  (Location) 

0 

Pitman,  Mood 

C 

Analysis  of  Variance* 

1.  000 

Kruskal-Wallis  H 

.955 

Andrews 

c.  2 

Friedman 

. 637-. 912 

Friedman 

k-Saniple  Median 

.637 

Andrews 

C,  3 

F  -  Ratio* 

1.000 

Mood's  Dispersion 

.  87 

Mood,  Dwass 

No.  Runs  (Dispersion) 

0 

Pitman,  Mood 

c 

Maximum  Likelihood* 

1.000 

for  Dispersion 

.74. 

Cox  Sc  Stuart 

for  Dispersion 

.71 

Cox  Sc  Stuart 

Correlation  Coeff.  * 

1.  000 

Kendall' t 

.912 

Moran 

Spearman's  p 

.912 

Hotelling  Sc  Pabst 

Blomqvist's  Median  Test 

.405 

Blomqvist 

Regression  Coeff.  b* 

1.  000 

Mann's  T 

.985 

Stuart 

'  c.  u 

Daniels 

.985 

Stuart 

Cox  Stuart's  S^ 

.  860 

Stuart 

Cox  Stuart's  S^ 

.  827 

Stuart 

Cox  Sc  Stuart's  S 

.782 

Stuart 

Median  test  for  Trend 

.782 

Stuart 

Rank  Serial  R^ 

0 

Stuart 

c 

Records  test  d 

0 

Stuart 

c 

Difference  sign 

0 

Stuart 

c 

Turning  Point 

0 

Stuart 

C  -  test  has  been  shown  to  be  consistent  under  certain  conditions. 

U  -  test  has  been  shown  to  be  \xnbiased  under  certain  conditions. 

1  -  Asymptotic  efficiency  is  1.  000  when  populations  have  uniform  distributions  (Pitman). 

2  -  Asymptotic  efficiency  is  1. 000  when  populations  have  uniform  distributions  (Andrews) 

3  -  Asymptotic  efficiency  is  .  333  when  populations  have  uniform  distributions  (Andrews). 
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TABUC  U 

POWER  COUPARISONS  OF  SOME  STATISTICAL  TESTS  APPLIED  TO  THE  SAME  DATA 


Titt»  in  Ordir  of  Decnai. 
ing  Power  (within  a  block) 

Null 

Hypotheeie 

Ae  eumptione 

Sample  Sieee 

Sig. 

Level 

Author  and  Type 
of  Comparieon 

Student*!  t-tiit 

Normal 

3,3;  2,«;  5,*»;  3,7 

X-tl!t 

Equal 

Dietributione 

3,7;  5,6 

van  der  Waerden 

Mann-Whitniy 

Meane 

Equal 

3,3;  3,7;  5,6;  l,*o;  2,*o 

.  05 

Max^biolute  Deviation 

Variance! 

3,7;  5,6;  5,*o 

Mathematical 

Number  of  Rune 

3,7;  5,6;  5,*e 

Equal 

Uniform 

van  der  Waerden 

X-tiit 

Meane 

Dietributione 

4,6 

.  05 

Mann- Whitney 

Equal 

Mathematical 

Student*!  t-teet 

Variance! 

Mann -Whitney 

Equal 

Normal 

Dixon 

Max.  Abeolute  Deviation 

Meane 

Dietributione 

5,5 

.  025 

Weetenberg  Median 

Equal 

Mathematical 

Variance! 

Mann- Whitney 

Normal 

Epetein 

Teao*e  Max,  Abe.  Dev. 

Equal 

Dietributione 

Epetein*!  Exceedancee 

Meane 

Equal 

10,  10 

.  05 

Empirical 

Number  of  Rune 

Variance! 

Lehmann*!  Moet  Powerful 

Identical 

Contlnuoue 

Mann-Whitney  (l-tailed) 

Population! 

Dietributione 

Weetenberg  Median  '* 

againet  y'e 

4,4;  6,6 

.  10 

Lehmann 

Mann-  Whitney  (2-tailed) 

Dietributed 

Weetenberg  Median  " 

ae  Maxi¬ 

Mathematical 

Max.  Abeolute  Deviation 

mum  x'e 

Number  of  Rune 

Regreeeion  Coefficient  b 

Normal 

Mann'e  T-teet 

DietributifMie 

Daniele 

Randomneee 

Foeter  It  Stuart 

Foeter  Ic  Stuart*!  D 

againet 

100 

.  05 

Foeter  It  Stuart*!  d 

Linear 

V 

Empirical 

Rank  Serial  Correlation 

Trend 

.  01 

Difference  Sign 

Turning  Point 

Number  of  Rune 

Randomneee 

.  05 

Bateman 

Longeet  Run 

ve.  Markoff 

Chain 

Mathematical 
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approaches  1  as  sample  size  approaches  infinity.  Therefore  when  a 
consistent  test  has  an  asymptotic  efficiency  of  zero  both  its  power  and 
the  power  of  the  comparison  test  are  very  close  to  1  and  are  approach¬ 
ing  1  as  sample  size  approaches  infinity.  The  power  of  the  comparison 
test,  however,  is  approaching  1  faster.  That  is,  at  any  **infinite**,  i.  e. 
extremely  large,  sample  size  the  power  of  the  comparison  statistic  is 
very  slightly  greater  than  that  of  the  test  whose  efficiency  is  sought, 
but  "infinitely**,  i.  e.  very  many,  more  observations  are  required  by 
the  test  with  zero  asymptotic  efficiency  to  close  this  infinitesimal 
power  gap.  Finally,  tests  with  zero  asymptotic  efficiency  with  respect 
to  the  same  comparison  test  do  not  necessarily  have  equal  asymptotic 
efficiency  with  respect  to  one  another.  For  example,  each  of  the  four 
tests  in  Table  I  having  zero  asymptotic  efficiency  with  regard  to  the 
regression  coefficient  has  zero  asymptotic  efficiency  with  respect  to 
all  of  the  seven  to  ten  tests  listed  above  it. 

A  number  of  investigators  have  compared  the  relative  powers 
of  distribution-free  tests  with  respect  to  each  other  without  actually  cal¬ 
culating  small-sample  efficiencies.  They  have  simply  been  compared 
under  identical  conditions  of  application  and  then  ranked  in  order  of  power. 
Sometimes  a  most  powerful  classical  statistic  was  included.  The  results 
(see  Table  II)  of  these  comparisons  are  naturally  highly  peculiar  to  the 
conditions  under  which  the  comparison  occurred. 

Certain  statisticians  (17,  31,  49,  50)  have  addressed  them¬ 
selves  to  the  problem  of  determining  "most  powerful**  distribution-free 
tests.  Although  successful,  the  gain  in  power  is  usually  slight  and  is 
generally  obtained  at  the  expense  of  simplicity.  Furthermore,  the  pro¬ 
perty  of  greatest  power  is  contingent  upon  the  type  of  distribution  assumed 
to  exist  when  the  null  hypothesis  is  false.  Lehmann  (31)  has  obtained 
the  most  powerful  rank  test  for  the  hypothesis  that  two  populations  have 
identical  distributions  against  the  alternative  that  the  second  population 
is  distributed  as  the  k  largest  observations  in  the  first  population. 

Terry  (49)  has  described  the  rank  test  which  is  asymptotically  most 
powerful,  at  the  point  for  testing  the  hypothesis  of  identical  dis¬ 

tributions  against  the  alternative  that  the  two  populations  are  normally 
distributed  with  the  same  variance  but  with  different  means.  His  test 
procedure  requires  that  the  +  N2  observations  be  ranked  in  order 
of  magnitude  irrespective  of  sample.  He  then  substitutes  for  each  rank 
the  average  magnitude  corresponding  to  that  rank  in  the  average  sample 
of  size  N]^  +  N2  from  a  normal  distribution  with  zero  mean  and  unit 
variance.  This  is  accomplished  by  means  of  tables  (XX  and  XXI) 
supplied  by  Fisher  and  Yates  (13).  Thus  scores  from  a  population 
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of  unknown  form  are,  in  a  sense,  transformed  so  as  to  represent  scores 
from  a  normal  distribution.  Exact  tables  of  probabilities  are  available 
for  Terry^s  test  for  +  N2  5  10,  an  asymptotically  normally  distri¬ 
buted  test  statistic  being  used  for  large  samples.  A  somewhat  similar 
test,  the  X-test,  has  been  proposed  by  van  der  Waerden  (50,  51),  The 
power  of  the  X-test  can  equal  that  of  Student^ s  t-test  when  applied  to 
normally  distributed  populations  (50)  and  can  exceed  the  power  of 
the  t-test  when  both  are  applied  to  uniformly  distributed  populations 
(52).  Both  Terry^s  and  van  der  Waerden^s  tests  are  analogous  to, 
and  appear  to  be  slightly  more  powerful  than,  the  Mann-Whitney  test. 
Both  have  the  dubious  advantage  of  giving  greater  **weight**  to  extreme 
observations  than  does  the  Mann-Whitney  test  (7).  Neither,  however, 
can  compare  with  the  latter  in  simplicity  or  ease  of  application.  Fur¬ 
thermore  the  quality  of  high  power  against  "parametric**,  i.  e.  normal, 
alternatives,  while  useful  is  not  an  overriding  consideration  in  select¬ 
ing  a  nonparametric  test.  It  is  a  useful  property  in  those  cases  where 
populations  are  normal  and  variances  homogeneous  but  the  experimenter 
does  not  have  certain  knowledge  of  this  fact,  i,  e.  ,  when  a  distribution- 
free  test  is  necessitated  by  the  experimenter's  ignorance  rather  than 
the  population's  nonnormiality. 

f.  Application.  The  applicability  of  most  tests  is  directly 
deducible  from  the  derivation  as  is  the  method  of  application.  Further¬ 
more,  many,  if  not  all,  distribution-free  tests  are  applicable  in  situa¬ 
tions  other  than  those  for  which  they  were  originally  designed,  and  it 
would  be  quite  impossible  to  anticipate  all  such  situations  and  to  out¬ 
line  the  testes  method  of  application  in  each  of  them.  Therefore,  only 
the  briefest  example  will  be  given  of  the  application  of  each  distribution- 
free  test,  and  the  **Application**  section  will  often  be  used  to  illustrate 
or  expand  upon  points  made  in  presenting  the  testes  derivation. 

g.  Discussion.  Tests  which  upon  superficial  examination 
appear  to  be  quite  distinct  may  actually  be  identical  or  similar  in 
function,  i.  e.  ,  may  ultimately  perform  the  same  or  nearly  the  same 
mathematical  operation.  In  other  cases,  although  different,  they  may 
be  mathematically  interrelated  to  a  high  degree.  Not  infrequently  the 
author  of  a  test  overstates,  understates  or  misstates  the  testes  capabil¬ 
ities,  Such  matters  are  taken  up  in  each  testes  **Discussion**  section. 

h.  Tables .  For  most  distribution-free  tests  probabilities  are 
based  upon  simple  combinatorial  formulae.  The  point  probability  of  a 
given  value  of  the  test  statistic  is  generally  a  fraction  whose  numerator 
is  the  number  of  different  ways  (combinations)  in  which  that  value  of  the 
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test  statistic  can  be  obtained  and  whose  denominator  is  the  sum  of 
the  number  of  different  ways  in  which  all  possible  values  of  the  test 
statistic  can  be  obtained.  Such  tests  are  usually  exact  for  small 
samples  whose  N  is  small  enough  to  permit  enumeration  of  the  com¬ 
binations  constituting  the  numerator  of  the  (cumulated)  probability 
fraction.  (The  denominator  is  usually  easy  to  obtain.  )  The  time 
and  labor  involved  in  these  computations  increases  drastically  with 
increasing  N,  however,  so  that  exact  tables  frequently  do  not  extend 
beyond  an  N  of  very  moderate  size.  For  larger  N*s  approximate 
probabilities  may  generally  be  obtained  fairly  easily  fromi  asymptotic 
formulae,  and  at  this  point  the  tables,  if  they  continue,  become  inexact. 

The  approximation  is  usually  very  good  for  large  values  of  N,  There 
is  sometimes  a  gap,  ho'vever,  between  the  largest  N  for  which  exact 
probabilities  have  been  tabled  and  the  smallest  N  at  which  the  asymptotic 
approximation  is  good. 

The  existence  of  adequate  tables  is  an  important  criterion  for 
the  acceptability  of  a  distribution-free  test.  There  is  practically  no 
limit  to  the  number  of  distribution-free  tests  which  can  be  devised  on 
a  sound  mathematical  basis.  However,  a  test  for  which  no  tables 
have  been  computed  is  of  very  limited  value  unless  exact  cumulated 
probabilities  can  be  easily  computed  by  formula,  or  unless  the  asymp¬ 
totic  approximation  is  good  at  small  sample  sizes,  neither  of  which 
is  likely  to  be  the  case. 

i.  Sources .  The  survey  of  literature  upon  which  this  report 
is  based  was  confined  almost  entirely  to  publications  written  in  English. 
However,  not  all  of  the  relevant  English  publications  were  reviewed  and 
only  a  fraction  ot  those  reviewed  are  reported.  The  number  of  relevant 
articles  is  immense  and  increases  exponentially  as  one  broadens  one^s 
definition  of  what  is  nonpar  am  etric.  An  attempt  was  made  only  to  cover 
tests,  of  broad  applicability,  whose  probabilities  can  be  calculated  ex¬ 
actly  when  samples  are  small,  and  which,  when  sampling  from  a  con¬ 
tinuously  distributed  population,  do  not  specify  the  exact  form  of  that 
distribution.  This  criterion,  for  example,  eliminated  tests  of  card 
matching,  which  apparently  find  application  only  in  experiments  on 
extra-sensory  perception,  approximate  tests  or  parametric  tests  used 
in  violation  of  their  assumptions,  and  tests  requiring  such  nonclassical 
but  specific  distributions  as  a  Poisson  or  an  exponential.  Despite  efforts 
at  thoroughness,  however,  it  is  virtually  certain  that  relevant  tests  meeting 
all  these  criteria  have  escaped  the  writer*s  attention;  in  some  cases  such 
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tests  were  detected,  but  were  unobtainable.  No  claim  is  made  for 
complete  coverage;  however,  it  is  felt  that  a  core  of  better  known 
and  more  important  tests  has  been  covered  fairly  adequately. 

In  the  following  chapters  tests  have  been  grouped  together 
largely  on  the  basis  of  a  common  type  of  mathematical  derivation, 
sometimes  according  to  the  type  of  sample  information  used,  and 
occasionally  according  to  the  type  of  function  which  the  test  serves. 
Only  the  simplest,  most  extensively  tabled,  and  most  promising  tests 
have  been  treated  at  length.  Sources  are  referenced  in  the  treatment 
of  each  test  and  are  listed  at  the  end  of  each  chapter.  (Occasionally 
reference  will  be  made  to  a  source  listed  in  the  bibliography  of  a 
different  chapter,  in  which  case  the  Arabic  reference  number  will 
be  preceded  by  a  Roman  numeral  indicating  the  number  of  the  chapter 
in  which  the  referenced  source  is  listed.  )  Because  the  number  of 
sources  relevant  to  a  given  test  or  to  a  general  topic  may  be  quite 
large,  those  sources  regarded  as  most  critical  have  been  indicated 
by  printing  their  authors*  names  in  capital  letters.  Primary  sources 
(or,  in  some  cases,  the  nearest  thing  to  a  primary  source)  for  a 
unique  distribution-free  test  have  been  indicated  by  an  asterisk. 
Sources  containing  tables  of  probabilities  for  a  distribution-free 
test  have  been  indicated  by  placing  a  capital  T  in  the  left  margin. 

If  the  source  contains  tables  for  more  than  one  such  test,  two  T*s 
are  used;  and,  if  a  table  is  an  extensive  one,  the  T  is  underlined. 


21 


BIBLIOGRAPHY 


1. 

Bahadur,  R.  R.  and  Savage,  L.  J.  ,  The  nonexistence  of 
certain  statistical  procedures  in  nonparametric  problems. 
Annals  of  Mathematical  Statistics,  1956,  27,  1115-1122. 

2. 

BLUM,  J.  R.  .  and  FATTU,  N.  A.,  Nonparametric  methods.  , 
Review  of  Educational  Research,  1954,  24,  467-487. 

3. 

Bradley,  R.  A.  ,  Some  notes  on  the  theory  and  application  of 
rank  order  statistics.  Parts  I  and  II.  .  Industrial  Quality 
Control,  1955,  11. 

TT  4. 

Burington,  R.  S.  and  May,  P.  C.  ,  Handbook  of  probability 
and  statistics  with  tables,  Handbook  Publishers  Inc.  , 
Sandusky,  Ohio,  1953. 

5. 

Chernoff,  H.  ,  A  measure  of  asymptotic  efficiency  for  tests  of 
a  hypothesis  based  on  the  sum  of  observations.  Annals  of 
Mathematical  Statistics ,  1952,  23,  493-507. 

6. 

Craig,  C.  C.  ,  Recent  advances  in  mathematical  statistics,  II. 
Annals  of  Mathematical  Statistics,  1942,  13,  74-85. 

7. 

VAN  DANTZIG,  D.  and  HEMELRIJK,  J.  ,  Statistical  methods 
based  on  few  assumptions,  (also  errata  and  addenda). 

Bulletin  of  the  International  Statistical  Institute,  1954,  34. 

TT  8. 

Dixon,  W.  J.  and  Massey,  F.  J.  ,  Introduction  to  statistical 
analysis,  New  York:  McGraw-Hill.  1951.  pp.  247-263. 

9. 

Dwass,  M.  ,  On  the  asymptotic  normality  of  certain  rank  order 
statistics^  Annals  of  Mathematical  Statistics,  1953,  24, 
303-306. 

10. 

Dwass,  M.  ,  On  the  asymptotic  normality  of  some  statistics 
used  in  non-parametric  tf^sts.  Annals  of  Mathematical 
Statistics^  1955,  26,  334-339. 

TT  11. 

Edwards,  A.  L.  ,  Statistical  methods  for  the  behavioral 
sciences^  New  York:  Rinehart.  1954.  pd.  181-212. 

399-439. 

22 


12.  Fisher,  R.  A.,  Contributions  to  mathematical  statistics, 
New  York:  Wiley,  1950. 

TT  13.  Fisher,  R.  A.  and  Yates  F.  ,  Statistical  tables  for  bio- 
logical  agricultural  and  medical  research,  3rd  Ed.  , 
New  York:  Hafner,  1949. 

14.  Fraser,  D.  A.  S.  ,  Nonparametric  methods  in  statistics, 
New  York:  Wiley,  1957. 


T  15.  Hald,  A.,  Statistical  theory  with  engineering  applications. 
New  York:  Wiley,  1952. 

16.  Hoeffding,  W.  ,  A  class  of  statistics  with  asymptotically 

normal  distribution.  Annals  of  Mathematical  Statistics, 
1948,  19,  293-325. 

17.  Hoeffding,  W.  ,  **Optimum**  nonparametric  tests.  Proceed¬ 

ings  of  the  Second  Berkeley  Symposium  on  Mathematical 
Statistics  and  Probability.  University  of  California  Press 
1951,  pp.  83-92. 

18.  Hoeffding.  W.  ,  Some  powerful  rank  order  tests,  (abstract) 

Annals  of  Mathematical  Statistics,  1952,  23,  303. 


19.  Hoeffding,  W.  ,  The  large-sample  power  of  tests  based  on 

permutations  of  observations.  Annals  of  Mathematical 
Statistics,  1952,  23,  169-192. 

20.  Hoeffding,  W.  and  Rosenblatt,  J.  ,  The  efficiency  of  tests. 

Annals  of  Mathematical  Statistics,  1955,  26,  52-63. 


T  21.  Hoel,  P.  G.  ,  Introduction  to  mathematical  statistics, 
2nd  Ed.,  New  York:  Wiley,  1954,  pp.  281-303. 


22.  Hotelling,  H.  and  Pabst,  Margaret  R.  ,  Rank  correlation 
and  tests  of  significance  involving  no  assumption  of  nor¬ 
mality.  Annals  of  Mathematical  Statistics,  1936,  7,  29-43 

TT  23.  Isaac,  W.  ,  Some  nonparametric  tests  and  their  tables, 
(mimeographed)  45  pp. 


24.  James,  G.  and  James,  R.  C.  ,  Mathematics  dictionary. 
New  York:  van  Nostrand,  1949. 


23 


TT  25. 

26. 

27. 

28. 


*T  29. 


*  30. 


*  31. 

32. 


*  33. 

34. 


35. 

36. 

37. 


Kendall,  M.  G.  ,  Rank  correlation  methods,  2nd  Ed.  , 

New  York:  Hafner,  1955. 

Kendall,  M.  G.  ,  The  advanced  theory  of  statistics,  Vol.  II, 
London:  Griffin,  1946. 

KENDALL,  M.  G.  and  BUCKLAND,  W.  R.  ,  A  dictionary  of 
statistical  terms.  New  York:  Hafner,  1957. 

KENDALL,  M.  G.  and  SUNDRUM,  R.  M.  ,  Distribution-free 
methods  and  order  properties.  Review  of  the  International 
Statistical  Institute,  1953,  3,  124-134. 

Kruskall,  W.  H.  and  Wallis,  W.  A.  ,  Use  of  ranks  on  one- 
criterion  variance  analysis.  Journal  of  the  American  Statis¬ 
tical  Association,  1952,  47,  583-621. 

Lehmann,  E.  L.  ,  Consistency  and  unbiasedness  of  certain 
nonparametric  tests ,  Annals  of  Mathematical  Statistics, 

1951,  22,  165-179. 

Lehmann,  E.  L.  ,  The  power  of  rank  tests.  Annals  of  Mathe¬ 
matical  Statistics,  1953,  24,  23-43. 

Lehmann,  E.  L.  and  Stein,  C.  ,  On  the  theory  of  some  non¬ 
parametric  hypotheses.  Annals  of  Mathematical  Statistics, 
1949,  20,  28-45. 

Mood,  A.  M.  ,  Introduction  to  the  theory  of  statistics.  New  York 
McGraw-Hill,  1950,  pp.  385-418. 

Moses,  L.  E.  ,  Nonparametric  methods.  Chapter  8  in  Walker, 
Helen  and  Lev,  J.  ,  Statistical  inference.  New  York;  Holt, 
1953,  pp.  426-450. 

Moses,  L.  E.  ,  Non-parametric  statistics  for  psychological 
research.  Psychological  Bulletin,  1952,  49,  122-143. 

Noether,  G.  E.  ,  On  a  theorem  by  Pitman.  Annals  of  Mathe- 
matical  Statistics,  1955,  26,  64-68. 

Ruist,  E.  ,  Comparison  of  tests  for  non-parametric  hypotheses. 
Arkiv  ftfr  Matematik,  1955,  3,  133-163. 


24 


38.  SAVAGE,  I.R.,  Bibliography  of  nonparametric  statistics  and 

related  topics.  Journal  of  the  American  Statistical  Association, 
1953,  48,  844-906. 

39.  SAVAGE,  I,  R,  ,  Contributions  to  the  theory  of  rank  order  stat¬ 

istics:  two  sample  case, I.  Annals  of  Mathematical  Statistics, 

1956,  27,  590-615. 

40.  Savage,  I.  R.  ,  Nonparametric  statistics.  Journal  of  the  American 

Statistical  Association,  1957,  52,  331- 344. 

41.  Scheffe,  H.  ,  Statistical  inference  in  the  non-parametric  case. 

Annals  of  Mathematical  Statistics,  1943,  14,  305-332. 

42.  SIEGEL,  S.  ,  Nonparametric  statistics.  American  Statistician, 

1957,  11,  13-19. 


TT  43.  SIEGEL,  S.  ,  Nonparametric  statistics  for  the  behavioral 
sciences.  New  York:  McGraw-Hill,  1956. 

44,  Smith,  K.  ,  Distribution-free  statistical  methods  and  the  con¬ 
cept  of  power  efficiency.  In  Festinger,  L.  and  Katz,  D.  (Eds.  ) 
Research  methods  in  the  behavioral  sciences.  New  York: 
Dryden  Press,  1953,  pp.  536-577. 


45.  Stevens,  S.  S.  ,  On  the  theory  of  scales  of  measurement. 

Science,  1946,  103,  677-680. 

46.  Stuart, A  .  ,  The  correlation  between  variate-values  and  ranks 

in  samples  from  a  continuous  distribution.  British  Journal  of 
Statistical  Psychology,  1954,  7,  37-44. 

47.  Stuart,  A.  ,  The  correlation  between  variate-values  and  ranks 

in  samples  from  distributions  having  no  variance.  British 
Journal  of  Statistical  Psychology,  1955,  8,  25-27. 

48.  Stuart,  A.  ,  The  cumulants  of  the  first  n  natural  numbers. 

Biometrika,  1950,  37,  446. 

49.  Terry,  M,  E,  ,  Some  rank  order  tests  which  are  most  powerful 

against  specific  parametric  alternatives.  Annals  of  Mathemat¬ 
ical  Statistics,  1952,  23,  346-366. 

50.  van  der  Waerden,  B,  L.  ,  Order  tests  for  the  two-sample  prob¬ 

lem  and  their  power.  Proceedings  Koninklijke  Nederlandse 
Akademie  van  Wetenschappen,  Series  A,  1952,  55,  453-458. 


25 


51. 


van  der  Waerden,  B.  L.  ,  Order  tests  for  the  two-sample  prob¬ 
lem  and  their  power.  (Corrigenda)  Proceedings  Koninklijke 
Nederlandse  Akademie  van  Wetenschappen,  Series  A,  1953, 

56,  80. 


52.  van  der  Waerden,  B.  L.  ,  Order  tests  for  the  two-sample  prob¬ 

lem  II  and  m.  Proceedings  Koninklijke  Nederlandse  Akademie 
van  Wetenschappen,  Series  A,  1953,  56,  303-310,  311-316. 

53.  Wald,  A.  and  Wolf owitz,  J.  ,  Statistical  tests  based  on  permuta¬ 

tions  of  the  observations.  Annals  of  Mathematical  Statistics, 
1944,  15,  358-372. 


54.  Wallis,  W.  A.,  Rough-and-ready  statistical  tests.  Industrial 

Quality  Control,  1952,  8,  35-40. 

55.  Whitworth,  W.  A.  ,  Choice  and  chance.  New  York:  Hafner, 

1951. 


*TT  56,  Wilcoxon,  F.  ,  Some  rapid  approximate  statistical  procedures . 
American  Cynamid  Co.  ,  Pamphlet,  1949. 


57, 


Wilks,  S.  S.  ,  Mathematical  statistics, 
Princeton  University  Press,  1950. 


Princeton,  N,  J: 


58,  WILKS,  S.  S.  ,  Order  statistics.  Bulletin  of  the  American 
Mathematical  Society,  1948,  54,  6-50. 


TT  59.  Wilson,  E.  B.  ,  An  introduction  to  scientific  research.  New 
York:  McGraw-Hill,  1952,  pp.  185-189,  197-202,  229-231, 
247-250,  266-268. 


60.  Wolfowitz,  J.  ,  Additive  partition  functions  and  a  class  of  statis¬ 

tical  hypotheses.  Annals  of  Mathematical  Statistics,  1942, 

13,  247-279. 

61.  Wolfowitz,  J.  ,  Non-par ametric  statistical  inference.  Proceedings 

Berkeley  Symposium  on  Mathematical  Statistics  and  Probability, 

Berkeley:  University  of  California  Press,  1949,  pp.  93-113. 

62.  Yule,  G.  U.  and  Kendall,  M.  G.  ,  An  introduction  to  the  theory 

of  statistics.  New  York:  Hafner,  1950,  pp.  19-68,  187-189, 
258-272,  459-481. 


26 


CHAPTER  n 


TESTS  BASED  ON  THE  BINOMIAL  DISTRIBUTION 

A  number  of  distribution-free  test  statistics  are  binomially 
distributed.  They  are  among  the  simplest,  safest,  most  nearly 
exact  and  most  extensively  tabled  nonparametric  tests.  Their 
statistical  efficiency  is  not  the  highest,  but  is  generally  not  so  low 
as  to  nullify  their  other  advantages.  The  sample  information  used 
by  most  of  them  is  simply  the  direction  of  the  difference  between 
two  scores,  i.  e.  ,  the  algebraic  sign  of  the  difference.  Binomial 
tests  are  extremely  versatile,  finding  application  in  testing  for  loca¬ 
tion,  trend  (in  either  location  or  dispersion),  randomness  of  predict¬ 
ed  order,  and  in  the  setting  of  confidence  limits  for  quantiles. 
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1.  Introduction 


Suppose  that  all  of  the  possible  outcomes  of  an  event  may¬ 
be  dichotomized  into  two  mutually  exclusive  categories,  arbitrarily 
labeled  "success"  and  "failure",  these  two  outcomes  having  proba¬ 
bilities  p  and  q=  1  -  p  respectively.  Then  if  the  event  is  permitted 
to  occur  n  times,  the  probability  that  r  of  the  n  outcomes  will  be 

successes  is  P^(r)  =(^)  which  is  the  general  expression  for  a 

term  in  the  expansion  of  the  binomial  (p+q)^. 

Proof:  The  probability  that  r  successes  and  n  -  r  failures 

will  occur  in  a  specified  order  is  p  q  .  For  example,  letting  sub¬ 
scripts  indicate  order  of  appearance,  the  probability  for  the  order  in 
which  all  successes  occur  first,  followed  by  all  failures,  is  the 

product  (pj^)  (P2)  ...  (Pj.)  (q^  +  l^  •••  However, 

since  we  seek  only  the  probability  of  a  given  frequency  of  successes, 

the  probability  p  q  of  a  given  frequency  of  successes  occurring 
in  a  specified  pattern  must  be  multiplied  by  the  number  of  patterns 
which  r  successes  and  (n-r)  failures  can  assume.  If  the  n  xmits 
(p*s  and  q^s)  were  all  distinguishable,  the  number  of  unique  patterns 
would  be  n2  ,  the  number  of  permutations  of  n  things.  They  are  not 
all  distinguishable  however.  In  each  distinguishable  pattern,  the  r 
successes  can  be  permuted  with  one  another  in  r!  ways  without 
changing  the  pattern.  And  for  each  such  permutation  of  successes, 
the  n-r  failures  can  be  permuted  in  (n-r)!  ways  without  changing  the 
appearance  of  the  pattern.  The  nximber  of  permutations,  ni  ,  then 
must  be  the  number  of  distinguishable  patterns  times  r!  (n-r)!  ,  the 
number  of  ways  each  distinguishable  pattern  can  be  permuted  without 
altering  its  appearance.  The  nvimber  of  distinguishable  patterns  is 

2 

therefore  —g-. — ■  .v.  ,  which  is,  of  course,  the  number  of  combina- 

r!  (n-r)l 

tions  of  n  things  taken  r  at  a  time,  frequently  expressed  by  the  symbol 
(j-)#  The  probability  of  exactly  r  successes  in  n  trials  is  therefore 
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(r)  and  the  cvimulative  probability,  i.  e.  ,  the  probability  of 

r  or  fewer  successes  in  n  trials  is  (^) 

1=0  1  ^ 

The  binomial  term  (^)  p^q^~^  expresses  the  probability  for 

r  successes  out  of  n  trials  only  if  the  following  conditions,  implicit 
in  its  derivation,  are  met: 

(a)  Outcomes  must  be  capable  of  being  dichotomized  (Since 
only  two  outcome  probabilities,  p  and  q,  are  used  in  the  derivation.) 

(b)  The  two  outcome  categories  must  be  mutually  exclusive 
(since  q  s  1  -  p). 

(c)  The  outcome  of  the  n  events  must  be  completely  inde¬ 
pendent.  (Since  the  same  value,  p,  is  used  to  express  the  probability 
of  success  on  each  of  the  n  trials,  the  probability  of  success  on  a  sin¬ 
gle  trial  must  not  change  from  one  trial  to  another  and,  therefore, 
must  not  be  influenced  by  the  outcome  of  any  other  trial.  ) 

(d)  "Events"  must  be  randomly  selected.  (The  formula 

(r)  P^q^~^  gives  the  probability  that  by  chance  r  successes  will 
occur  in  n  trials  if  the  chance  probability  of  success  in  a  single  trial 
is  p.  If  events  are  not  randomly  selected,  then  outcomes  are  sus¬ 
ceptible  to  nonchance  influences.  )  There  must  therefore  be  no  bias 
or  system  in  the  selection  of  which  n  trials,  out  of  an  infinite  popula¬ 
tion  of  potential  trials,  to  test.  Specifically,  among  other  things 
this  means  that  none  of  the  valid  data  may  be  systematically  excluded 
from  the  test. 

The  above  qualifications  will  appear  in  modified  form  as 
assumptions  for  all  tests  whose  test  statistic  is  binomially  distri¬ 
buted.  Such  tests  are  outstanding  among  distribution-free  tests  for 
two  reasons:  First  they  are  extremely  simple,  both  in  derivation 
and  in  application.  Second  exact  probabilities  for  both  the  point 
(20,  28)  and  cumulative  (34,  25,  28)  binomial  have  been  extensively 
tabled.  Thus,  while  for  most  distribution-free  tests  large  n*s  re¬ 
quire  probabilities  to  be  calculated  approximately  from  asymptotic 
formvilae,  in  the  case  of  binomial  tests  exact  probabilities  are  readily 
attainable  for  many  large  samples. 
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The  mean  and  variance  of  a  binomially  distributed  variate 
are  np  and  npq  respectively  (for  proof  see  Hoel  [l-2l]  pp.  65-67)  , 
and  when  n  is  large  and  p  is  close  to  .  50  the  binomial  is  closely  ap¬ 
proximated  by  the  normal  distribution.  The  critical  ratio  for  r, 

the  number  of  successes,  is  therefore  ,  the  1/2 

V  npq 

being  a  correction  for  continuity.  The  normal  approximation  should 
not  be  used  except  for  those  cases  not  covered  by  the  extensive  binomial 
tables  which  are  now  available.  The  approximation  is  reasonably  good 
so  long  as  the  product  np  is  greater  than  5.  Even  when  this  criterion 
is  met,  however,  the  approximation  is  likely  to  be  poor  at  the  extreme 
tails  of  the  distribution,  especially  when  n  is  small  (say  less  than  100). 
The  inaccuracy  of  the  normal  approximation  can  be  expected  to  increase 
therefore  with  decreasing  n,  with  increasing  departures  of  p  from  .  50 
in  either  direction,  and  with  decreasing,  i.  e.  more  and  more  extreme, 
significance  levels. 


2.  The  Sign  Test  for  the  Medizin  Difference 

a.  Rationale.  Suppose  that  n  pairs  of  measurements  have  been 
taken,  one  member  of  each  pair  having  been  taken  vmder  condition  A, 
the  other  under  condition  B,  and  that  a  B  measurement  is  as  likely  to 
exceed  as  to  be  exceeded  by  its  paired  A  measurement.  Then,  if 
zero  differences  are  impossible,  the  differences  A^-B.  Ccin  be  either 
positive  or  negative  cind  the  outcome  "positive”  is  binomially  distributed 
with  probability  p  =  1/2.  For  example,  John  Arbuthnott  (1)  fo\md  that 
every  year  from  1629  to  1710  the  nvunber  of  males  born  in  the  city  of 
London  exceeded  the  number  of  females.  K  male  cind  female  babies 
are  equally  likely,  the  chance  probability  of  the  reported  results  is 

■^0  =  (l/2)^^»  (Arbuthnott  obtained  this  result  and  inter¬ 

preted  the  excess  of  male  births  as  a  manifestation  of  Divine  Providence, 
which  he  believed  to  be  allowing  precisely  for  the  greater  mortality  rate 
among  males  "who  must  seek  their  Food  with  danger",  so  as  to  leave  a 
perfect  equality  of  sexes  at  the  age  of  mating.  ) 

b.  Null  Hypothesis.  For  every  A^-B^  difference,  P^(A^>B^)  = 
P^(A^<B^)=1/2.  Sufficient  conditions  for  its  validity  are  that  both  the  A 
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population  and  the  B  population  are  continuously  distributed  and  the 
population  of  A  -  B  differences  has  a  medicin  of  zero. 

c.  Assvunptions.  Since  binomial  tests  require  that  outcomes 
must  be  of  two  types  only,  there  must  be  no  zero  differences,,  i.  e.  , 
the  members  of  no  pair  shall  be  "tied.  "  Frequently  this  requirement 
is  expressed  by  the  more  restrictive  assximption  that  the  population 

of  differences  is  continuously  distributed.  Since  the  outcomes  of 
binomial  events  must  be  independent,  the  sign  of  the  difference  for 
one  pair  must  have  no  influence  upon  the  sign  of  the  difference  for 

any  other  pair.  This  mezuis  among  other  things,  that  a  given  A 

measurement  shall  be  paired  once  and  only  once  with  a  measurement 
from  the  B  population.  Finally,  the  sample  of  measurements  must 
have  been  randomly  selected  from  the  parent  population  of  differences. 

d.  Treatment  of  Ties.  The  null  hypothesis  is  that 

P  (A  >B  )  =  P  (A  <B.)  =  1/2.  Therefore  P  (A.  =  B.)  must  equal 
r'  i  i  r'  i  1  '  r  1  1 

zero.  Zero  differences  constitute  a  third  category  of  outcomes. 

Since  the  Sign  test  is  based  upon  the  binomial  distribution  which  re¬ 
quires  that  outcomes  fall  into  two  mutually  exclusive  classes,  zero 
differences  are  decidedly  embarrassing.  They  can  occur  for  two 
reasons;  because  a  noninfinitesimal  proportion  of  the  parent  popu¬ 
lation  of  differences  is  zero,  or  because,  although  this  is  not  the  case. 
Zero  differences  are  obtained  due  to  the  inability  of  the  measuring  in¬ 
strument  to  achieve  infinite  precision.  In  the  former  case,  the  Sign 
test  simply  is  not  appropriate.  For  the  latter  case,  various  methods 
have  been  recommended  for  disposing  of  zero  differences.  They  Cein 
be  dropped  and  n  reduced  accordingly  (14,  1-8,  27).  Half  may  be  treat¬ 
ed  as  plusses,  half  as  minuses  (8,  27).  They  may  be  replaced  by  signs 
"drawn"  randomly  from  an  infinite  population  half  of  whose  members  are 
plusses,  half  of  which  are  minuses  (27).  Or  all  zeros  may  be  treated 
as  if  they  had  the  algebraic  sign  least  conducive  to  rejection  of  the  null 
hypothesis. 

The  Sign  test  has  greatest  power  when  zero  differences  are 
dealt  with  according  to  the  first  alternative.  However,  the  greater 
power  resulting  from  use  of  this  method  is  not  necessarily  an  argu¬ 
ment  for  its  adoption.  A  zero,  being  in  a  sense  "halfway  between" 
a  plus  and  a  minus  suggests  that  plusses  and  minuses  are  equally 
likely.  By  ignoring,  i.  e.  discarding,  data  which  lend  support  to  the 
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null  hypothesis,  one  naturally  increases  the  probability  of  rejecting 
that  hypothesis  and  consequently  enhances  the  power  of  the  test.  The 
probability  of  rejecting  a  true  null  hypothesis  has  also  increased, 
however,  and  the  apparent  gain  in  power  is  attributable  to  a  subtle 
increase  in  the  **true**,  as  contrasted  with  the  nominal,  significance 
level.  For  example,  consider  1000  differences  of  which  9^0  are 
zero,  13  plus  and  27  minus.  If  half  of  the  zeros  are  regarded  as 
plus  and  half  as  minus  and  the  two-tailed  Sign  test  is  applied  to  the 
493  plusses  and  507  minuses,  the  cumulative  probability  is  .681. 

If  the  zero  differences  are  discarded  and  the  test  is  applied  to  the 
13  plusses  and  27  minuses,  the  cumulative  probability  falls  within 
the  .  05  level  of  significance.  Assuming  that  half  the  zeros  actually 
represent  plus  scores,  half  minus  scores,  the  "true**  cumulative 
probability  is  .681  in  both  cases.  However,  in  the  latter  case  the 
experimenter  believes  his  significance  level  to  be  .  05  when  actually 
the  true  significance  level  corresponding  to  this  alleged  figure  would 
be  some  figure  greater  than  .681.  Thus  discarding  the  zeros  biases 
the  test  toward  rejection. 

The  **randomization**  method  preserves  exactly  the  mathemat¬ 
ical  conditions  upon  which  the  validity  of  the  Sign  test  depends.  How¬ 
ever,  it  makes  little  sense  experimentally.  Normally  one  interprets 
small  chance  probabilities  as  implying  the  presence  of  a  nonchance 
effect.  But  if  it  is  known  that  pure  chance  determined  a  substantial 
portion  of  one*s  results,  then  small  chance  probabilities  may  imply 
unlikely  chance  effects  as  strongly  as  (or  more  strongly  than)  non¬ 
chance  effects.  In  such  cases  the  null  hypothesis  may  remain  as 
reasonable  as  any  alternative  hypothesis.  Ambiguities  may  also 
arise  in  marginal  situations.  Suppose  for  example  that  an  experi¬ 
menter  using  the  .  05  level  of  significance  obtains  significant  results 
after  **r andomizing**  zeros,  but  discovers  that  his  results  would  have 
a  **chance**  probability  of  .  15  had  he  regarded  half  the  zeros  as  plusses, 
half  as  minuses.  The  reverse  situation  would  be  equally  distressing. 

The  first  three  methods  of  dealing  with  zero  differences  are 
based  upon  an  implicit  assumption  that  zero  differences  represent  true 
differences  which,  if  measured  with  infinite  accuracy,  would  be  found 
to  be  positive  half  of  the  time,  negative  half  of  the  time.  However, 
if  zero  differences  are  due  to  imprecision cf  measurement,  as  it  is 
assumed,  such  a  50-50  split  is  by  no  means  assured.  The  **measuring 
instrument**  might  be  such  that  all  differences  between  -.  0015  and 
+  .  0040  were  measured  as  zero.  One  would  then  expect  the  preponder - 
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ance  of  recorded  zeros  to  represent  true  plusses. 

None  of  the  methods  of  dealing  with  zero  differences,  there¬ 
fore,  is  entirely  satisfactory.  Giving  all  zeros  the  sign  least  condu¬ 
cive  to  rejection  is  the  safest  method  while,  in  the  long  run,  the  average 
probability  error  is  minimized  by  treating  half  the  zeros  as  plusses, 
half  as  minuses.  If  only  a  small  proportion  of  the  differences  are  zero, 
say  less  than  5%,  one  would  expect  the  "error"  introduced  by  zero 
differences  generally  to  be  of  small  practical  consequence.  However, 
when  zeros  constitute  a  substantial  proportion  of  the  data,  considerable 
caution  should  be  used  in  applying  the  Sign  test. 

e.  Efficiency.  A  normal  distribution  is  symmetrical  with  median 
equal  to  mean.  Therefore,  if  applied  to  a  normally  distributed  popula¬ 
tion  of  differences,  the  Sign  test  for  the  medicin  difference  is  equally  a 
test  for  the  mean  difference  and  Ccin  legitimately  be  compared  with 
Student’s  t-test.  Under  the  conditions  stated,  the  one-tailed  Sign 
test  has,  relative  to  Student’s  t,  cin  asymptotic  efficiency  of  Z/r  or 
.  637.  This  same  figure  is  obtained  whether  the  asymptotic  efficiency 
be  an  estimate  efficiency  (4,  44)  A.  R.  E.  (15),  or  cin  efficiency  of 
certain  other  types  (7,  15,  16).  It  refers,  of  course,  to  the  case 
where  the  discrepancy  6  between  the  true  difference  and  hypothesized 
difference  is  zero,  i.  e.  ,  very  slight.  If  samples  are  of  infinite  size, 
the  efficiency  of  the  Sign  test  is  independent  of  the  size  a  of  the  signi¬ 
ficance  level,  but  decreases  from  .  637  to  a  limiting  value  of  .  500 
as  6  increases  from  zero  to  infinity  (15). 

The  small  sample  efficiency  of  the  Sign  test  depends  strongly 
upon  the  precise  definition  of  efficiency  chosen  (2).  It  decreases  with 
increasing  values  of  n,  a  and  6  (7).  Small  sample  efficiencies  as 
high  as  .  96  have  been  foxind  (43). 

Power  functions  for  the  Sign  test  have  been  published  by  Dixon 
(7)  and  by  Walsh  (42).  Stewart  (36)  has  prepared  tables  giving  the 
sample  size  at  which  a  false  null  hypothesis  (p  =  .  50)  will  have  a  given 
probability  of  rejection,  i.  e.  ,  test  will  have  agiven  power,  at  the  .  05 
level  of  significance,  for  various  "true"  values  of  p.  The  test  is  con¬ 
sistent  provided  only  that  p^^  q,  i.  e.  ,  in  the  present  case  provided  only 
that  the  null  hypothesis  is  false  (14). 

f.  Application.  Subtract  each  B  score  from  its  matched,  i.  e. 
paired,  A  score.  If  a  small  proportion  of  the  differences  are  zero. 
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"assign"  half  of  them  a  positive  sign,  half  a  negative  sign;  if  there 
are  an  odd  number  of  zero  differences,  discard  one  zero  difference, 
reduce  n  by  one,  and  proceed  as  above.  Let  r  be  the  number  of 
plusses  and  n-r  be  the  number  of  minuses  after  the  zeros  have  been 
"assigned.  "  Then  the  cumulative  probability  of  obtaining  r  or  fewer 


plusses  by  chance  if  the  null  hypothesis  is  true  is 


(?  1/2". 


If  a  two-tailed  test  is  required,  one  rejects  the  null  hypothesis  if  this 
cumvilative  probability  equals  or  is  less  than  a/2  or  equals  or  exceeds 
1-  a/2,  If  a  one-tailed  test  is  required  and  the  cilternative  hypothesis 
is  that  the  median  difference  is  less  than  zero,  the  null  hypothesis  is 


rejected  if  (^)  l/2^<a.  For  the  opposite  alternative,  reject 

if  the  summation  equals  or  exceeds  1-a  , 


g.  Tables.  Probabilities  can  be  most  accurately  obtained 
from  tables  of  the  cumulative  binomial  (34,  25,  28,  46)  entered  with 

p=.50.  Other  tables  (4,  8,  26,  1-8,  1-23,  1-43,  1-59)  have  been 
designed  specifically  for  the  Sign  test. 


h.  Discussion.  Mathematically  the  Sign  test  simply  tests 
the  hypothesis  that  the  parameter,  p,  of  a  binomial  population  has  the 
value  .  50.  In  equivalent  experimental  terms  it  tests  the  null  hypothe¬ 
sis  that  the  population  of  A-B  differences  has  a  median  of  zero.  The 
inference  is  frequently  made  that  if  the  median  difference  is  zero,  then 
the  A  population  and  the  B  population  are  equally  "good"  in  a  quantitative 
sense.  Such  an  inference  cannot  legitimately  be  made  without  introduc¬ 
ing  an  additional  assumption:  that  the  A-B  differences  are  symmetri¬ 
cally  distributed  about  zero.  Without  this  assumption  one  can  legiti¬ 
mately  infer  that  half  of  the  units  comprising  the  A  population  are  superior 
to  the  vmits  with  which  they  happen  to  be  matched  in  the  B  population 
and  that  half  of  the  B  units  are  superior  to  their  paired  mates  from  the 
A  population,  but  not  that  these  two  "superiorities"  represent  equivalent 
difference  magnitudes.  It  is  to  be  noted  that  the  assumption  of  symmetry 
requires  that  the  mean  difference  be  zero. 

By  adding  M  to  each  B  score  before  subtraction  from  its  paired 
A  score,  one  can  test  the  null  hypothesis  that  the  medicin  difference  is 
M.  If  the  assumption  of  symmetry  can  justifiably  be  made,  one  can 
test  the  hypothesis  that  the  mean  difference  is  M,  or,  in  other  words. 


34 


that  the  A  population  is  on  the  average  M  units  "better**  than  the  B 
population.  By  multiplying  each  B  score  by  1+1  OOp  before  subtraction, 
one  can,  imder  the  assumption  of  symmetry,  test  the  hypothesis  that 
the  A  population  is  on  the  average  p  percent  **better**  than  the  B  popu¬ 
lation.  (See  8  or  26) 

The  preceding  discussion  has  assumed  that  every  A  score  has 
the  same  parent  population  and  likewise  for  every  B  score.  Actually 
the  formula  holds  good  even  if  every  A  or  B  score  comes  from  a  differ¬ 
ent  population  so  long  as  each  population  corresponding  to  a  given  A-B 
difference  has  zero  median.  The  null  hypothesis  tested  is  that  all  of 
the  populations  from  which  the  A-B  differences  were  **drawn**  have  zero 
median.  This  type  of  application  should  be  approached  with  caution, 
however.  Suppose,  for  example,  that  half  of  the  pairs  represent  popu¬ 
lations  in  which  A*s  are  truly  superior  to  B^s  while  the  reverse  is  true 
for  the  other  half.  Although  the  null  hypothesis  is  entirely  false,  the 
probability  of  its  rejection  is  no  greater  than  if  it  were  true.  Again, 
suppose  that  for  a  tenth  of  the  pairs  A*s  are  truly  superior  to  B*s  while 
for  the  remainder  there  is  no  real  difference.  The  power  of  the  test 
would  be  much  greater  if  that  tenth  of  the  data  were  tested  separately. 
Applications  of  the  type  described,  therefore,  may  greatly  reduce  the 
power  of  the  test,  and  even  when  the  null  hypothesis  is  rejected,  it  is 
not  at  all  clear  what  alternative  hypothesis  is  indicated.  Finally,  in 
this  type  of  application,  the  modifications  described  in  the  preceding 
paragraph  become  meaningless  and  should  not  be  used. 

It  has  been  stated  that  the  Sign  test  is  particularly  appropriate 
when  the  members  of  each  pair  were  subjected  to  similar  treatment, 
but  when  treatments  differed  from  one  pair  to  another.  This,  of 
course,  represents  a  special  case  of  the  application  discussed  above. 
Here  it  is  implied  that  a  number  of  variables  may  have  a  real  effect 
upon  the  absolute  values  of  the  A*s,  the  B*s  or  even  the  A-B  differ¬ 
ences,  but  that  only  one  variable,  the  one  in  which  the  experimenter 
is  interested,  can  have  a  real  effect  upon  the  direction  of  the  A-B 
differences,  i.  e.  ,  the  signs  of  the  differences.  This  is  not  necessar¬ 
ily  an  unrealistic  assumption.  For  example,  the  A*s  and  B*s  might 
be  positions  of  seismograph  needles  during,  and  an  hour  previous  to, 
an  hypothesized  tremor.  The  seismographs  being  located  in  widely 
different  parts  of  the  world,  the  A-B  differences  would  be  expected  to 
vary  in  size  with  distance  from  the  source  of  tremor.  Furthermore, 
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the  numerical  size  of  the  difference  might  be  reported  in  metric  imits 
by  some  and  in  British  units  of  measurement  by  others.  These  con¬ 
siderations  would  preclude  the  use  of  a  t-test,  but  not  the  Sign  test 
since  the  variable  mentioned  would  affect  the  size  but  not  the  direction 
of  the  differences. 

It  is  extremely  important,  however,  that  no  variable  causing 
differences  between  pairs  shall  interact  with  the  variable  in  which  the 
experimenter  is  interested,  i.  e.  ,  shall  differentially  affect  the  sign  of 
the  difference  between  members  of  a  pair.  Suppose,  for  example,  that 
A  and  B  are  two  strains  of  wheat  and  that  some  of  the  AB  pairs  were 
grown  in  a  northeastern  county,  the  rest  in  a  southwestern  county.  If 
the  former  location  has  a  moist  climate,  the  latter  a  dry  one,  it  may 
well  be  that  A  is  superior  to  B  in  one  location  and  inferior  in  the  other. 
Subjecting  pairs  to  different  treatments,  therefore,  may  introduce  subtle 
and  spurious  interactions  between  "tested"  and  ''nontested"  effects  with 
the  result  that  the  power  of  the  test  is  reduced  and  the  true  alternative 
hypothesis  may  differ  greatly  from  the  alleged  one. 

i.  Sources.  1,  2,  4,  7,  8,  10,  12,  13,  14,  15,  16,  26,  27,  3  0, 
36,  42,  43,  44,  45,  1-2,  1-3,  1-8,  I-ll,  1-21,  1-23,  1-28,  1-35,  1-43, 
1-54,  1-59. 


3.  The  Sign  Test  for  the  Median 

a.  Rationale.  Suppose  that  n  observations,  X^’s,  are  taken 
from  a  continuously  distributed  population  whose  median  is  M.  Then 
half  of  the  observations,  on  the  average,  should  fall  above  M,  half 
below,  i.  e.  ,  the  number  of  observations  falling  above  M  is  binomially 
distributed  with  p=  .  50.  Thus,  the  number  of  observations  above  an 
hypothesized  median  M  can  be  used  to  test  the  validity  of  the  hypothesis. 
But  the  number  of  observations  above  M  is  the  same  as  the  number  of 
positive  differences  if  M  is  subtracted  from  each  observation.  The 
Sign  test  for  the  median,  therefore,  is  equivalent  to  the  Sign  test  for 
the  median  difference  in  which  the  X£*s  constitute  the  A  population  and 
the  B  population  consists  of  the  single  value  M. 

b.  Null  Hypothesis.  For  every  X^,  P^(X^  >  M)=P^(X^<M)=l/2. 

Sufficient  conditions  for  its  validity  are  that  the  X's  are  drawn  indepen¬ 
dently  and  are  continuously  distributed  with  a  common  population  median 
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M.  It  is  in  fact  only  necessary  to  assume  that  the  X's  are  continuous¬ 
ly  distributed  in  the  neighborhood  of  M. 

c.  Assumptions,  (i)  p^(X.  =  M)  =  0,  i.e.,  none  of  the 

observations  must  fall  on  the  hypothesized  median. 

(2)  Whether  a  given  falls  above  or  below 
M  is  independent  of  the  position  of  any  other  with  respect  to  M.  This 
implies  among  other  things  that  either  the  population  is  an  infinite  one, 
which  will  be  the  case  if  it  is  continuously  distributed,  or  sampling  is 
with  replacement. 


(3)  The  Xj^*s  must  have  been  randomly 
selected  from  their  respective  populations. 

d.  Treatment  of  Observations  Falling  on  the  Hypothesized 
Median.  See  2.  Treatment  is  analogous. 


e.  Efficiency.  See  2.  Efficiencies  quoted  under  2  apply  with 
equal  validity  to  the  test  for  the  median. 


f.  Application.  Count  the  number,  r,  of  X’s  which  are 
smaller  than  M.  If  a  small  proportion  of  the  X*s  equal  M,  count 
half  of  them  as  smaller  than  M.  If  there  are  an  odd  number  of  such 
tied  X’s,  discard  one  of  them  and  reduce  n  by  1.  For  a  two-tailed 
test  at  the  level  a,  reject  the  null  hypothesis  if 

(^)  l/2^  <  a/2  or  >  1  -  a/2.  If  the  alternative  hypothesis  for  a 
1=0  1  —  — 

one-tailed  test  is  that  the  population  median  exceeds  M,  reject  the 


null  hypothesis  if  l/2^  <  a.  For  the  opposite 

one-tailed  alternative  hypothesis,  reject  if  the  summation  >  1  -  a. 


g.  Tables .  See  2  and  the  paragraph  below. 


h.  Discussion.  If  the  X*s  are  arranged  in  order  of  increasing 
magnitude  with  subscripts  indicating  rank  in  that  order  (1=  smallest, 
n  =  largest)j  then  if  r  observations  are  below  M,  M  exceeds  the  value 
X^,  i.e.,  M>X^.  Therefore,  rejecting  the  null  hypothesis  because  r 
observations  have  fallen  below  the  median  is  equivalent  to  rejecting 
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it  because  the  median  exceeds  X  .•  Walsh  (43)  has  prepared  tables 
of  probabilities  for  the  Sign  test  for  the  median  which  call  for  this 
approach. 

If  the  X^s  all  come  from  the  same  continuously  distributed 
population  whose  mean  equals  its  median  (which  will  be  the  case  if 
the  population  is  symmetrically  distributed),  the  Sign  test  for  the 
median  is  equivalent  to  a  test  for  the  mean.  In  other  words  at 
the  cost  of  introducing  two  new  assumptions,  homogeneous  popula¬ 
tions  and  symmetrical  distribution,  the  Sign  test  for  the  median  be¬ 
comes  a  Sign  test  for  the  mean.  By  adding  (or  subtracting)  a  con¬ 
stant  C  to  every  X  before  applying  the  test,  the  hypothesis  can  be 
tested  that  the  population  mean  has  "slipped”  a  distance  C  below  (or 
above)  a  value  it  is  known  to  have  had  at  some  earlier  period. 

i.  Sources .  See  2* 


4.  Cox  and  Stuart^ s  S^  Sign  Test  for  Trend  in  Location 

a.  Rationale.  Suppose  that  2n  measurements  have  been  record¬ 
ed  or  are  available  in  an  order  of  sequence  and  it  is  desired  to  test  whe¬ 
ther  the  sequence  may  contain  a  monotonic,  i.  e.  ,  nonreversing,  trend. 

If  there  is  no  trend  of  any  kind,  i.  e.  ,  if  sequential  position  has  no  effect 
upon  measurement  magnitudes,  these  magnitudes  will  be  randomly  dis¬ 
tributed  in  sequence.  If  measurements  are  divided  into  independent  pairs 
and  if  in  each  pair  the  measurement  later  in  sequence  is  subtracted  from 
the  earlier  measurement,  the  sign  of  each  difference  will  be  as  likely 
to  be  plus  as  to  be  minus.  If  zero  differences  are  impossible,  the  num¬ 
ber  of  differences  of  one  sign  will  be  binomially  distributed.  On  the 
other  hand,  if  a  unidirectional  trend  exists  differences  of  one  sign  will 
tend  to  predominate. 

b.  Null  Hypothesis.  Let  subscripts  represent  the  position  of 
a  given  measurement  in  the  sequence  of  2n  measurements.  The  null 
hypothesis,  then,  is  that  for  every 

X.  with  i  <  n  the  (X.  >  (X.  <  X.^^)=  l/2. 

Sufficient  conditions  for  its  validity  are  that  the  X*s  are  continuously 
distributed  and  are  randomly  related  to  sequence,  i.e.  ,  contain  no 
trend. 
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c.  Assumptions .  ^  ^  ^i+n^  ^  ^  ‘for  every  i^n, 

i.  e.  ,  the  members  of  no  pair  are  tiecJ. 


i+n 


(2)  Whether  a  given  falls  above  or  below 
is  independent  of  the  outcome  for  any  other  pair, 

(3)  The  X*s  are  randomly  selected. 


d.  Treatment  of  Ties.  The  authors  recommend  counting 
half  the  zero  differences  as  plusses,  half  as  minuses.  Also  see  2.. 


e.  Efficiency  Applied  to  populations  known  to  be  normally 
distributed,  the  S2  test  for  trend  in  location  has  asymptotic  relative 
efficiency  .  78  with  respect  to  the  best  parametric  test,  based  on  the 
regression  coefficient  (37).  Under  the  same  conditions,  it  has 
A.  R,  E.  .79  compared  to  Spearman*s  or  Kendall*s  rank  correlation 
tests  used  as  tests  of  randomness  (5).  For  other  comparisons.  See 
Table  I. 


f.  Application.  If  the  total  number  of  measurements  is  not 

an  even  number,  drop  the  middle  measurement  to  make  it  so.  Let 
2n  stand  for  the  number  of  measurements  remaining.  From  each  X^^ 
in  the  first  half  of  the  sequence,  subtract  the  corresponding  measure¬ 
ment  in  the  second  half.  If  a  small  proportion  of  the  differences 

are  zero,  assign  half  of  them  a  plus,  half  a  minus.  If  an  odd  zero 
remains,  discard  it  and  reduce  n  by  1.  Let  r  be  the  number  of  posi¬ 
tive  differences.  Then  for  a  two-tailed  test  at  significance  level  a 

reject  the  null  hypothesis  if  1/2^  either  equals  or  is  less 

than  a/2  or  equals  or  exceeds  l-a/2.  For  a  one-tailed  test  at  the  level  a 
reject  the  null  hypothesis  if  (^)  l/Z^  ^  if  alternative  hypo¬ 

thesis  is  an  upward  trend  (or  ^1-aif  alternative  hypothesis  is  a  down¬ 
ward  trend). 

g.  Tables .  See  2  . 

h.  Sources.  (5,  11) 
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5.  Cox  and  Stuart’s  Sign  Test  for  Trend  in  liocation 

a.  Rationale.  See  4,  substituting  "3n"  for  "2n". 

b.  Null  Hypothesis.  Let  subscripts  represent  the  position  of 
a  given  measurement  in  the  sequence  of  3n  measurements.  The  null 
hypothesis,  then,  is  that  for  every 

X.  with  i  <  n  the  P  (X.  >  X.^,  )  =  P  (X.  <  X.  . ,  )  =  l/2. 

1  —  r  '  1  i+2n'  r  i  i+2n'  ' 

Sufficient  conditions  for  its  validity  are  that  the  X’s  are  continuously 
distributed  and  are  randomly  related  to  sequence,  i.  e,  ,  contain  no  trend. 


c.  Assumptions .  See  4,  substituting  ''^£^2n"  "^i+n”' 

d.  Treatment  of  Ties.  See  4, 

e.  Efficiency.  Applied  to  populations  known  to  be  normally 
distributed,  the  S^  test  for  trend  in  location  has  A.  R.  E.  .  83  with 
respect  to  the  best  parametric  test,  based  on  the  regression  coeffi¬ 
cient  (37).  Under  the  same  conditions,  it  has  A.  R.  E.  .  84  compared 
to  Spearman’s  or  Kendall’s  rank  correlation  tests  used  as  tests  of 
randomness  (5).  For  other  comparisons  see  Table  I. 


f.  Application.  If  the  total  number  of  measurements  is  not 
divisible  by  3,  "add"  one  or  two  "dummy"  measurements  in  the  middle 
of  the  sequence  to  make  it  so.  Let  3n  stand  for  the  number  of  meas¬ 
urements  as  modified.  From  each  X^  in  the  first  third  of  the  sequence, 
subtract  the  corresponding  measurement  last  third.  The 

data  in  the  middle  third  will  not  be  used.  If  a  small  proportion  of  the 
differences  are  zero,  assign  half  of  them  a  plus,  half  a  minus.  If  an 
odd  zero  remains,  discard  it  and  reduce  n  by  1.  Let  r  be  the  number 
of  positive  differences.  Then  for  a  two-tailed  test  at  significance 


or 


level  a,  reject  the  null  hypothesis  if  ^n^  either  equals 

is  less  than  a/2  or  equals  or  exceeds  l-a/2.  For  a  one-tailed  test 


at  level  a,  reject  the  null  hypothesis  if  )  l/2^  <  a  if  alterna¬ 

tive  hypothesis  is  an  upward  trend  (or  ^  1-a  if  alternative  hypothesis  is 
a  downward  trend). 
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g»  Tables .  See  2. 


h.  Discussion.  The  test  uses  only  2/3  of  the  raw  data 
employed  by  the  S2  test;  however,  the  members  of  each  pair  of  meas¬ 
urements  whose  difference  is  taken  are  1/3  farther  apart.  The  net 
result  is  an  increase  in  efficiency.  If  a  real  trend  exists,  then  the 
farther  removed  two  measurements  are  in  sequence,  the  greater  the 
expected  difference  in  magnitudes  and  the  more  likely  that  the  sign  of 
the  difference  will  betray  the  direction  of  the  trend.  The  S2  test, 
however,  has  one  advantage.  Since  it  uses  all  of  the  data,  statistical 
inference  can  be  extended  to  the  entire  parent  population.  Strictly 
speaking,  inferences  based  on  the  S^  test  cannot  legitimately  be 
extended  to  the  middle  third  of  the  sampled  sequence,  since  a  temporary 
trend  occupying  only  this  portion  could  not  be  detected. 

i.  Sources .  5,  11,  37. 

6.  Cox  and  Stuart^s  S^  Sign  Test  for  Trend  in  Dispersion 

a.  Rationale.  Suppose  that  3kn  measurements  have  been  re¬ 
corded  in  order  of  sequence  and  it  is  desired  to  test  whether  the  disper¬ 
sion  of  the  measurements  about  a  linear  regression  line  changes  mono- 
tonically  with  position  of  measurements  in  the  sequence.  If  the  true 
dispersion  remains  constant,  then  the  ranges  of  consecutive  sets  of  k 
measurements  should  vary  on  a  chance  basis  only.  And  if  the  range 
of  a  subsequent  set  is  subtracted  from  that  of  an  earlier  set,  the  dif¬ 
ference  is  as  likely  to  be  positive  as  to  be  negative-  If  zero  differences 
are  impossible,  the  number  of  differences  of  one  sign  will  be  binomially 
distributed.  On  the  other  hand,  if  dispersion  changes  monotonically 
with  position  in  sequence,  differences  of  one  sign  will  tend  to  predom¬ 
inate. 


b.  Null  Hypothesis.  Let  Wj^  represent  the  range  of  the  i  th 
consecutive  set  of  k  measurements.  The  null  hypothesis,  then,  is 

that  for  every  w.  with  i<  n  the  P  (w.  >  w.  .  ^  )  =  P  (w.  <  w.  . )  =  \/Z. 
^  1  —  r  1  i+2n  r  1  i+2n 

Sufficient  conditions  for  its  validity  are  that  the  X*s  are  continuously 
distributed  with  constant  dispersion  about  a  linear  regression  line. 

c.  Assumptions.  (1)  (w.  =  ^  every  n. 
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i.  e.  ,  the  members  of  no  pair  are  tied.  If  the  X’s  are  continuously- 
distributed,  the  w*s  -will  be  also  and  the  assumption  will  be  satisfied. 

(2)  Whether  a  given  falls  above  or  below 
^i+2n  unaffected  by  the  outcome  for  any  other  such  pair. 

(3)  The  X’s  are  randomly  selected. 

d.  Treatment  of  Ties.  See  4. 

e.  Efficiency.  Applied  to  populations  known  to  be  normally 
distributed,  the  S,  test  for  dispersion  has  A.R.E.  of  .  7  1  compared  to 
the  maximum  likelihood  test  (5). 

f  .  Application.  The  selection  of  the  integer  k  is  arbitrary  and 
will  not  affect  the  validity  of  the  test;  however,  it  can  be  expected  to 
affect  the  test’s  power.  Letting  N  stand  for  the  total  number  of  meas¬ 
urements,  the  following  rule  is  suggested  by  the  authors; 

take  k  =  2  if  N  <  48,  take  k  =  3  if  48  <  N  <  64,  take  k  =  4  if  64  <  N  <  90, 

take  k  =  5  if  N  >  90.  Let  n  be  the  integral  part  of  N/'^k  and  drop  N-3kn 

measurements  from  the  middle  of  the  sequence.  Divide  the  3kn  remain¬ 
ing  measurements  into  3n  consecutive  sets  of  k  measurements  each. 

Find  the  range  of  measurements  within  each  of  the  3n  sets.  Finally, 
using  these  ranges  as  scores  or  measurements,  proceed  exactly  as  in 
the  test  for  trend  in  location. 


g.  Tables .  See  2. 

h.  Discussion.  This  test  can  be  made  a  test  for  trend  in 
variance,  (or  standard  deviation)  simply  by  substituting  this  term  for 
"range"  and  applying  the  test  as  outlined  above. 

The  authors  do  not  suggest  the  use  of  the  S2  test  to  test  for  dis¬ 
persion,  although  it  obviously  could  be  legitimately  used  for  that  purpose. 

i.  Sources.  5,  11. 


7 ,  Noether* s  Sequential  Test  for  Linear  Trend 

Cox  and  Stuart’s  tests  for  trend  in  location  give  specific  values 
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to  a  constant,  C,  in  a  more  general  test  discussed  by  Noether  (23,  24). 
The  latter  author,  in  effect,  sets  the  null  hypothesis  that 

P  (X.  >  X.  ^)  =  P  (X.  <  X.  ._)  =  1/2  and  examines  the  optimum  value 
r  1  i+C  r  1  i+C 

of  C  for  a  sequential  probability  ratio  test  of  that  hypothesis. 


8.  Noether*s  Binomial  Test  for  Cyclical  Trend 

a.  Rationale.  Suppose  that  3n  measurements  have  been  re¬ 
corded  or  are  available  in  order  of  sequence  and  it  is  desired  to  know 
whether  the  sequence  may  contain  a  fluctuating  or  cyclical  trend.  If 
the  measurements  are  continuously  distributed  and  there  is  no  trend 
of  any  kind,  no  two  measurements  will  be  equal,  and  the  measurements 
will  be  randomly  related  to  sequence.  Any  three  consecutive  meas¬ 
urements  will  be  equally  likely  to  have  any  of  the  six  sequences  repre¬ 
sented  by.  the  six  possible  permutations  of  three  things.  However,  of 
these  six  sequences  only  two  are  monotonic,  i.  e.  ,  ascend  or  descend 
without  reversals,  while  the  remaining  four  change  direction  in  the 
middle.  For  example,  if  the  three  measurements  are  ranked,  the 
ranks  will  be  found  to  have  one  of  the  six  sequences:  1  2  3,  3  2  1, 

132,231,  21  3,  31  2,  the  underlined  sequences  being  mono¬ 

tonic.  The  probability  of  monotonicity  for  such  a  set  of  three  meas¬ 
urements  is  therefore  1/3  if  the  sequence  is  random  and  the  meas¬ 
urements  are  continuously  distributed.  And  if  the  3n  measurements 
are  divided  into  n  independent  sets  of  3  consecutive  measurements 
each,  the  number  of  monotonic  sets  will  be  binomially  distributed 
with  p  =  1/3.  On  the  other  hand,  if  a  cyclical  or  fluctuating  trend  of 
any  but  the  shortest  possible  "wave  length"  exists,  one  would  expect 
more  than  1/3  of  the  sets  to  be  monotonic. 


b.  Null  Hypothesis.  For  every 


i  <  n,  the  (X3.  >  X3._j  >  X3..2)  +  (X3.  <  X3..3  <  Xj..^)  =  l/3. 


Sufficient  conditions  for  its  validity  are  that  the  X’s  are  continuously 
distributed  and  the  size  of  the  X's  is  unrelated  to  their  position  in 
sequence. 


c.  Assumptions.  (1)  P^  (X3.  =  X3.^^)  =  0  and  P^  = 
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for  every  i  <  n,i.  e.  ,  adjacent  scores  in  no  set  are  tied. 

(2)  Whether  or  not  any  given  set  is  mono¬ 
tonic  is  independent  of  the  monotonicity  or  nonmonotonicity  of  any  other 
set.  Among  other  things ,  this  means  that  no  X  is  used  in  more  than 
one  set. 


(3)  The  X*s  are  randomly  drawn. 

d.  Treatment  of  Ties.  Ties  are  a  practical  problem  only 
when  the  tied  scores  are  members  of  the  same  set.  H  the  first  and 
third  scores  are  tied  and  the  second  is  not,  the  set  is  clearly  non¬ 
monotonic  and  there  is  no  ambiguity.  If  adjacent  members  of  a  set 
are  tied,  the  set  is  as  likely  as  not  to  be  monotonic;  therefore,  half 
of  such  sets  should  be  counted  as  monotonic,  half  as  nonmonotonic 
(the  odd  set,  when  it  exists,  being  discarded  and  n  reduced  by  1)  . 

If  all  three  members  of  a  set  are  tied,  the  chance  probability  of 
monotonicity  is  obviously  1/3,  and  one  third  of  such  sets  should  be 
counted  as  monotonic  (one  or  two  sets  being  discarded  and  n  reduced 
accordingly  if  the  number  of  such  sets  is  not  divisible  by  3). 

e.  Efficiency.  Noether  states  that  he  does  not  believe  the 
test  to  be  highly  efficient. 

f.  Application.  If  the  total  number  of  measurements  is  not 
divisible  by  3,  drop  one  or  two  measurements  from  the  middle  of  the 
sequence  to  make  it  so.  Let  3n  stand  for  the  number  of  measure¬ 
ments  remaining.  Divide  these  3n  imeasurements  into  n  independent, 
i.  e.  nonoverlapping,  sets  of  3  consecutive  measurements.  Count 

the  number  of  monotonic  sets,  treating  tied  members  of  a  set  as  out¬ 
lined  above.  Call  this  number  r  and  call  the  total  number  of  sets 
used  n.  Then  for  a  one-tailed  test  at  significance  level  a  reject  the 

null  hypothesis  if  (^)  (l/3)^  (2/3)^  ^  <  a.  This  tests 

against  the  one-sided  alternative  that  once  a  direction  is  taken  it 
tends  to  persevere  for  a  longer  than  chance  period.  A  two-sided 
test  would  include  the  alternative  that  direction  fluctuates  more  rapid¬ 
ly  than  would  be  expected  by  chance.  However,  such  a  contingency 
seems  unlikely  to  be  of  great  practical  interest,  since  such  a  fluctua¬ 
tion  in  this  case  would  very  nearly  amount  to  alternation  of  direction, 
i.  e.  ,  change  with  every  measurement. 
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g.  Tables .  34,  25,  28. 


h.  Discussion.  This  test  is  also  presented  by  its  author  as 
a  sequential  probability  ratio  test. 

Lehmann  (17,  38)  has  briefly  proposed  a  test  of  the  hypothesis 
that  two  populations  are  identical,  which  is  analogous  to  Noether* s  test. 
If  2n  scores  have  been  drawn  from  an  X  population,  and  2n  from  a  Y 
population,  and  if  X*s  are  paired  at  random  with  one  another  and  then 
with  a  pair  of  likewise  paired  Y*s,  there  will  result  n  independent 
quadruples  consisting  of  two  X*s  and  two  Y*s.  If  the  null  hypothesis 
is  true,  and  the  X*s  and  Y*s  are  continuously  distributed,  the  chance 
probability  that  in  a  given  quadruple  both  X*s  will  either  be  greater 
than  or  less  than  both  Y*s  is  1/3.  The  number  of  quadruples  for  which 
this  is  the  case  will  therefore  be  binomially  distributed  with  p  =  l/3  and 
can  be  used  to  test  the  hypothesis  of  identical  populations.  The  test  is 
consistent  if  the  sampled  populations  are  continuous,  ties  are  random¬ 
ized  and  the  alternative  hypothesis  is  that  p^  l/3, 

i.  Sources.  24. 


9.  Mosteller*s  Test  of  Predicted  Order 


a.  Rationale.  Suppose  that  n  individuals  each  are  to  be  tested 
under  k  conditions  and  the  experimenter  has  reason  to  believe  that  he 
can  predict  the  order  of  excellence  of  performance  under  the  k  conditions. 
If  ^’performance"  is  continuously  distributed  so  that  no  two  conditions 
will  result  in  the  same  score,  then  for  any  one  individual  there  are 

kl  orders  in  which  the  k  conditions  could  be  arranged.  If  performance 
is  independent  of  the  conditions  under  which  it  is  tested,  then  each  of 
the  kl  orders  is  equally  likely  with  probability  1/k*  .  If  performance 
is  truly  unrelated  to  differences  among  tested  conditions,  then  the  num¬ 
ber  of  individuals  whose  order  of  performance  has  been  correctly  pre¬ 
dicted  is  binomially  distributed  with  p  =  l/k!  .  On  the  other  hand, 
if  performance  is  related  to  conditions  and  if  the  experimenter  has 
correctly  predicted  the  relationship,  the  predicted  order  will  tend  to 
exceed  its  chance  expectation, 

b.  Null  Hypothesis.  (Order  Predicted  by  Experimenter) 

*  1/kl  .  Stifficient  conditions  for  its  validity  are  that  measurements 
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are  continuously  distributed  and  unrelated, to  the  specific  experimented, 
conditions  under  which  they  occur. 

c.  Assumptions.  (1)  None  of  the  performance  scores  for  a 
single  individual  can  be  tied. 


(2)  The  order  of  performance  excellence 
for  any  given  individual  is  unaffected  by  that  of  any  other  individual. 

(3)  Individuals  and  individual's  scores 

are  randomly  selected. 

d.  Treatment  of  Ties.  Ties  are  no  practical  problem  imless 
one  of  the  possible  ways  of  "breaking"  the  ties  results  in  the  predicted 
order.  In  those  cases,  for  every  group  of  t  tied  scores,  there  will 

be  t!  ways  of  breaking  the  ties,  and  if  there  is  more  than  one  such  group 
for  a  single  individual,  the  number  of  ways  of  breaking  the  ties  will  be 
the  product  of  these  factorials.  Therefore,  for  each  individual  whose 
order  of  performance  contains  ties  and  could  be  the  predicted  order 
if  the  ties  are  broken  properly,  find  the  number  of  ways  in  which  ties 
could  be  broken.  Sum  these  over  all  such  individuals,  and  call  the 
total  "D".  Let  N  stand  for  the  number  of  such  individuals.  Then 
N/D  is  the  proportion  of  these  individuals  whose  order  shoxild  be  re¬ 
garded  as  the  predicted  one,  and  (N/D)  N  or  N^  /D  individuals  should 
be  counted  as  having  the  predicted  order.  Simpler  techniques,  which 
err  in  the  direction  of  conservatis-m,  are  to  regard  the  N  individuals 
as  not  having  the  predicted  order,  or  to  discard  the  N  individuals  and 
reduce  n  by  N. 

e.  Efficiency.  Apparently  unknown. 

f.  Application.  Treating  ties  by  one  of  the  techniques  outlined 
above,  count  the  number  of  individuals  whose  performance  under  the  k 
conditions  conforms  exactly  to  the  predicted  pattern,  i.  e.  ,  whose  per¬ 
formance  excellence  under  each  condition  has  the  rank  predicted  for 
performance  under  that  condition.  Let  this  number  be  r  and  the  total 
number  of  individuals  tested  be  n.  Since  a  smaller  than  chance  num¬ 
ber  of  individuals  having  the  predicted  order  is  vinlikely  to  be  of  interest 
to  the  experimenter,  only  a  one-tailed  test  of  the  opposite  situation  will 
be  outlined.  For  a  one-tailed  test  at  the  level  a,  reject  the  nxill  hypo¬ 
thesis  in  favor  of  the  alternative  that  the  predicted  order  has  a  greater 
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than  **chance’*  probability  if  {  ^)  (l/k])  (1  -  l/kl)  <  a* 

g.  Tables,  34,  25,  28. 

h.  Discussion.  It  is  very  important  to  remember  that  this 
test  tests  only  that  if  the  k  conditions  affect  performance  differen¬ 
tially  the  experimenter  has  done  a  better  than  chance  job  of  predict¬ 
ing  the  pattern.  Suppose  that  of  15  conditions  10  affect  performance 
in  the  same  way  and  are  therefore  equivalent,  while  the  remaining  5 
conditions  affect  performance  differentially.  If  the  experimenter  cor¬ 
rectly  assigns  one  of  the  ranks  from  1  to  15  to  each  of  the  five  differ¬ 
entiating  conditions,  the  predicted  rank  order  will  occur  more  fre¬ 
quently  than  1/15!  of  the  time  and  the  null  hypothesis  will  tend  to  be 
rejected  more  than  a  of  the  time.  However,  the  predicted  rank 
order  will  not  be  correct  for  the  10  equivalent  conditions  since  it 

will  imply  that  they  differ,  which  they  do  not.  Suppose  again  that 
five  conditions  arranged  in  order  of  ’^excellence"  are  ABODE 
and  that  the  experimenter  has  predicted  the  order  A  B  C  E  D.  If 
the  conditions  differ  greatly  relative  to  performance  variability,  the 
experimenter’s  predicted  order  may  be  expected  to  occur  less  than 
1/51  of  the  time;  while,  if  performance  variability  is  large  relative 
to  the  true  differences  among  conditions,  the  experimenter’s  predicted 
order  may  be  expected  to  occur  more  than  1/5!  of  the  time  and  the  null 
hypothesis  will  tend  to  be  rejected  more  than  a  of  the  time.  The  temp¬ 
tation  to  accept  the  predicted  order  as  the  correct  one,  when  the  null 
hypothesis  is  rejected,  should  therefore  be  resisted. 

i.  Sources.  34  (Introduction,  pp,  xxxvi-xxxvii). 


10.  Confidence  Limits  for  Quantiles 


a.  Rationale.  Assume  that  a  random  sample  of  n  independent 
observations  has  been  taken  from  an  unknown  but  continuously  distributed 
population,  and  that  it  is  desired  to  establish  confidence  limits  for  the 
magnitude  of  a  population  quantile,  Q.  This  quantile  may  be  a  percen¬ 
tile,  quartile,  median,  or,  more  generally,  that  population  magnitude 
below  which  some  specified  proportion  p  of  the  population  lies. 

Let  the  n  sample  observations  be  arranged  in  order  of  increas¬ 
ing  magnitude  with  subscripts  indicating  rank  position  in  that  order,  i.  e.  , 
from  smallest  to  largest  the  observations  are  Xj^,  X^,  X^,  .  .  .  .  ,  X  , 
. .  Xg,  .  .  ,  .  ,  ^n-L  ^n*  Also,  let  e  be  an  infinitesimally 
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small  positive  magnitude.  If  Q  lies  at  or  below  Xj.  +  g  then  r  or 
fewer  sample  observations  have  fallen  below  the  population  quantile 

Q,  the  chzince  probability  for  which  is  (^)  p^  (1  -p)^"^,  where 


p  is  the  proportion  of  units  in  the  population  whose  magnitude  is  less 
than  Q.  Likewise,  if  Q  lies  at  or  above  Xg  -  e  then  n  -  s  +1  or  fewer 
sample  observations  exceed  Q,  or  equivalently,  s  -  1  or  more  obser¬ 
vations  are  smaller  than  Q.  The  chzince  probability  for  this  is 


iSs-i  0  p^i  -  p) 


n-i 


With  qualifications  which  will  be  outlined  under  "Assumptions", 
these  two  probabilities  may  be  regarded  as  the  probabilities  that  Q  lies 
below  Xj.  +  e  and  that  Q  lies  above  Xg  -  e  respectively.  If  s  is  larger 
them  r,  the  events  referred  to  by  these  two  probabilities  are  mutually 
exclusive  (since  e  is  zm  infinitesimal).  Therefore  the  nrobability 
that  Q  is  neither  below  Xj.  +  e  nor  above  Xg  -  e  is 


(1-p)  "i^s-l^i^P  °’^i^r+l^i^P  ^^“P^ 


,n-i 


and  this  is  equivalently  the  probability  that  Q  lies  between  X^.  +  e 
and  Xg  -  e  Since  e  is  zm  infinitesimal,  it  is  also  the  probability  that 

X  <  Q  <  X  . 
r  s 

b.  Assumptions.  Rzmdom  sampling  and  independent  obser¬ 
vations  are  assumed  for  reasons  given  in  (1).  The  assumption  of 
continuous  distribution  is  required  in  order  to  rtde  out  tied  observa¬ 
tions.  Actually,  ties  become  a  practical  problem  only  when  they 
occur  at  the  critical  end  points  of  the  confidence  region,  i.  e.  ,  when 
Xj.  is  tied  with  or  Xg  with  Xg_2^.  Such  ties  render  the  end  points 

of  the  confidence  region  indistinct  and  impose  an  additional  (see  next 
assumption)  element  of  inexactitude  upon  the  calculated  confidence 
level.  If  Xj.  and  Xj.^.^  are  tied,  for  example,  then  Xj.  +  e  cannot  be 
greater  than  Xj.  and  equal  to  or  less  than  as  required  by  the  deri¬ 
vation.  The  tied  observations  Xj.  and  represent  a  third  category 

of  outcomes,  e.  g.  ,  on  rather  than  above  or  below  the  median,  thus 
rendering  the  binomial  zm  inappropriate  mathematical  model.  The 
assumption  of  a  continuous  distribution  is  also  required  because  it 
implies  zm  infinite  population.  If  the  population  is  infinite,  the  prob¬ 
ability  of  an  observation  smaller  than  Q  is  p  for  every  observation; 
if  the  poptilation  is  finite,  the  probability  for  every  observation  after 
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the  first  depends  upon  the  outcomes  of  the  previous  drawings.  The 
final  assumption  is  incompatible  with  the  immediately  previous  one. 

It  is  that  there  is  zero  probability  that  the  population  quantile  Q  lies 
between  Xj.  and  or  between  and  X^.  The  probability 

thatX^  <Q  <  was  derived  to  be  (1  ~  P)^ 

however,  this  is  precisely  the  same  probability  which  would  have  been 

obtained  for  the  event  i’  implies  that 

(X^  <  Q  <  X^^^)  =  0  and  that  (Xg_^  <  Q  <  X^)  =  0  which  offends 

common  sense.  Phrased  differently,  the  derivation  given  under 
"Rationale”  took  e  to  be  an  infinite ssimal,  but  would  have  led  to  the 
same  results  if  e  had  been  any  positive  value  such  that 

^  ^  ^  ^r+l  ^s-1  ^  ^s  ”  Again,  this  obviously  implies  the 
untenable  assumption  that  Q  cannot  occupy  the  region  between  Xj.  and 
Xr+i  or  between  Xg_2^  and  Xg.  The  reason  for  the  discrepancy  is 
simply  that  "r  observations  below  Q"  and  "r+l  observations  below  Q" 
are  two  "adjacent"  eventualities  in  a  discrete  distribution  of  "number 
of  sample  observations  below  the  population  quantile  Q".  Since  this 
is  a  discrete  distribution  of  frequencies,  there  is  no  event  "in  between" 
the  two  named.  However,  "population  quantile  is  X^."  and  "population 
quantile  is  X^^^"  are  nonadjacent  eventualities  in  a  continuous  distri¬ 
bution  of  magnitudes  assignable  to  the  population  quantile.  An  error 
has  therefore  been  introduced  by  using  a  discrete  distribution,  i.  e. 
the  binomial,  to  express  probabilities  for  a  continuously  distributed 
variable.  In  terms  of  confidence  limits,  the  error  is  no  larger  than 

the  difference  between  the  confidence  limits  X  <Q<X  and  X  ,  ,  <Q<X  . 

r  s  r+l—  —  s-1 


c.  Treatment  of  Ties.  If  either  X^.  and  X^^j^  or  Xg_2^  and 
Xg  are  tied,  it  is  suggested  that  the  confidence  region  be  changed 
(i.  e.  shifted,  expanded  or  contracted)  so  as  to  have  untied  endpoints. 
The  conservative,  i.  e.  safest,  approach  would  be  to  reduce  r  or  enlarge 
s  to  the  extent  necessary  to  include  within  the  confidence  region  all  ob¬ 
servations  which  had  been  tied  with  the  endpoints.  The  confidence 
level  will,  of  course,  have  to  be  recalculated  for  the  new  confidence 
region  determined  by  the  reassigned  values  of  r  and  s. 
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d.  Application.  Let  Q  be  the  unknown  magnitude  of  the  popula¬ 
tion  score  below  which  a  specified  proportion  p  of  the  population  scores 
lie.  Draw  n  sample  observations  from  this  population  and  rank  these 
observations  from  smallest  (1)  to  largest  (n).  Ties  should  be  dealt  with 


as  outlined  in  the  preceding  paragraph.  Take  (^)  (l~p) 


,n-i 


to  be  the  confidence  level  for  the  hypothesis  that  Q  lies  in  one  of  the 
following  confidence  rqgions.  If  the  most  conservative  probability 
statement  is  desired,  take  <  Q  <  as  the  confidence  region. 

However,  if  greatest  accuracy  is  desired  in  the  sense  of  minimizing 

Xr+Xr^l  Xg_^+Xg 

the  error,  take  the  confidence  region  to  be  -  <  Q  <  - . 

2  —  —  2 

The  former  will  usually  be  the  more  conscionable  procedure.  The 
values  p,  r,  and  s  must,  of  course,  be  selected  prior  to  sampling. 

e.  Tables.  34,  25,  28,  See  also  19,  1-8  p.  360, 

f.  Discussion,  The  a  priori  probability  that  the  magnitude 

of  the  r  ranked  observation  will  be  less  than  Q  is  not  the  exact 

th  ' 

probability  that  the  magnitude  obtained  for  the  r  ranked  observation 
will  be  less  than  Q,  Even  in  the  obtained  sample,  X  could  be  assigned 

^  th 

any  magnitude  between  X^  and  and  still  be  the  r  observation 

in  order  of  magnitude.  The  range  of  magnitudes  "represented"  by  X^, 

X  ,+X  X  +X  ,, 

then,  might  be  considered  to  be  - - -  to  - - - ,  i.e,  ,  the 

point  halfway  between  and  the  next  lower  magnitude  to  the  halfway 
point  to  the  next  higher  magnitude.  Then  if  the  rank  of  r  represents 

X  +X 

magnitudes  as  high  as  — ^ - ElLL,  the  summation  p^  (l^p)^  ^ 


^r+^r+l 

would  give  the  probability  that  Q  lies  below  - - -  rather  than 

below  X^.  Obviously,  then,  the  probability  that  Q  is  less  than  or  equal 
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to  is  no  greater  than  (^)  (l-p)^~^.  Therefore,  we  can  be  con¬ 
fident  at  least  at  the  level  l^)  p^(l-p)^  ^  that  X  <  Q  <  X  .  By 

introducing  an  inequality  then  we  can  make  a  definitiye  probability 
statement  which  takes  account  of  the  error  discussed  under  the  last 

’•assumption".  It  is  that  (X^  <  Q  <  X^)  >  (^)  Also, 

if  instead  of  the  most  conservative  probability  statement,  we  wish  to  make 

X  +X  X  ,+X 

the  most  nearly  accurate  one,  we  can  take  -  <  Q  <  - 

2  -  —  2 

as  the  most  probably  "true"  confidence  interval  corresponding  to  the 

confidence  level  (^)  P^  (l~p)^ 

If  Q  is  taken  to  be  the  population  median,  the  confidence  level 
becomes  simply  (^) 

It  is  important  to  note  that  the  "error"  implicit  in  this  method 

appears  only  when  setting  confidence  limits  for  the  unknown  magnitude 

of  a  specified  quantile,  Q.  If  the  magnitude  of  Q  is  hypothesized  to  be 

a  single  specified  value,  Q',  then  an  exact  test  of  the  hypothesis 

Q  =  Q'  can  be  made  by  rejecting  if  Q'  lies  outside  of  the  confidence 

limits  X  <  Q  <  X  . 

r  s 

The  methods  just  discussed  establish  confidence  limits  for 
the  unknown  magnitude  or  score  below  which  a  fixed  proportion  of  the 
population  lies.  Binomial  methods  have  also  been  suggested  (3,  6,  31) 
by  which  to  obtain  confidence  limits  for  an  unknown  population  propor¬ 
tion  on  the  basis  of  the  proportion  of  an  obtained  sample  corresponding 
to  a  specified  category.  These  methods,  however,  appear  to  be  cumber¬ 
some,  inexact,  or  both. 


g.  Sources. 
zuid  1-8  pp.  320-323, 


19,  32,  40. 
360.) 


(See  also  3,  6,  9,  21,  22,  31,  33, 
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CHAPTER  III 


THE  MULTINOMIAL  DISTRIBUTION 

The  multinomial  distribution  is  important  in  a  study  of  distribution- 
free  tests  because  it  plays  a  role  in  the  derivation  of  a  number  of  exact 
tests.  It  is  also  the  exact  distribution  appropriate  to,  but  too  compli¬ 
cated  for,  the  type  of  test  situation  in  which  the  chi  square  statistic 
is  commonly  used.  Chi  square  is  in  fact  derived  from  the  multino¬ 
mial  by  means  of  a  series  of  approximations,  tantamount  to  assump¬ 
tions,  which  render  chi  square  inexact  when  sample  size  is  not  in¬ 
finite,  and  which  necessitate  considerable  skill  in  applying  it  properly. 
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1.  Derivation  and  Assumptions 


a.  Derivation.  Let  an  event  have  k  possible  outcomes,  designated 

by  subscripts  1,  2,  .  .  .  ,  k,  and  let  these  outcomes  be  mutually  exclusive 

and  independent  and  have  probabilities  p,  ,  p_,  .  .  .  ,  p  such  that 

X  ^  ic 

•S,  p^=  1.  If  the  event  is  allowed  to  occur  n  times,  the  probability 

t 


that  the  respective  frequencies  of  occurrence  of  the  various  outcomes 
will  be  exactly  n^^,  n'^,  .  .  .  ,  n^^  is 


n: 


n  2^ «  *  * 


n.  n2 

Pi^  Py 


.  p^k,  or  nj  ^ 


n. 

k  Pi^ 


n. 


1=1 


n. 


k* 


Proof:  The  probability 


that  the  outcomes  will  occur  exactly  n^  ,  n_,  .  .  .  ,  n  times  respectively 
and  in  a  completely  specified  order  (for  example,  the  order  in  which 
the  first  n^  outcomes  are  those  whose  probability  is  p^,  the  next  n^, 

those  whose  probability  is  p^,  etc.)  is  p^l  p^^  .  .  .  p^k  .  To  obtain 

the  probability  for  these  frequencies,  but  in  any  order,  the  preceding 
product  must  be  multiplied  by  the  number  of  distinguishable  orders. 

The  n  outcomes  can  be  permuted  in  nl  ways.  But  in  any  one  of  these 
permutations,  there  are  n^  outcomes  of  the  first  category  which  are 

the  same  and  which  can  be  permuted  among  themselves  in  n^i  ways 

without  changing  the  appearance  of  the  order.  And  for  each  of  these 
n^i  permutations,  the  outcomes  of  the  second  category  can  be  permuted 

with  one  another  in  n^ {  ways  without  changing  the  appearance  of  the  ori¬ 
ginal  order,  etc.  There  are  thus  n^i  n^I  .  ..  n^^J  ways  in  which  each 

di stingui shable  order  pattern  can  be  permuted  without  creating  a  pattern 
distinguishable  from  it.  Since  nl  is  the  number  of  distinguishable  pat¬ 
terns  times  n^i  n^i  .  .  .  n^^I  ,  the  number  of  distinguishable  patterns  of 


order  is 


n; 


n, .'  n_J  ...  n,  ] 
12  k 


and  the  probability  that  in  n  trials  the  k  cate¬ 


gories  of  outcomes  will  occur  n^i  n^,  .  .  .  n^^  times  respectively  is 


nl 


n 


r  ^2* 


n.  n2 

Pi^  Py 


Puk 


b.  Assumptions,  Since,  in  the  derivation  ,  the  same  value,  p. 
was  taken  as  the  probability  for  outcome  i  in  each  of  its  n.  occurrences. 
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must  not  vary  from  event  to  event.  The  outcome  of  a  given  event 
must  therefore  be  independent  of  the  outcomes  of  any  of  the  n-1  other 
events.  Not  only  must  the  probabilities  of  the  various  possible  out¬ 
comes  of  an  immediate  event  be  independent  of  the  actual,  observed, 
outcomes  of  the  previous  events,  they  must  also  be  mutually  exclusive. 
This  assumption  is  necessitated  by  the  fact  that  the  probability  of  a 
given  set  of  n  outcomes  was  obtained  in  the  derivation  by  taking  the 
product  of  the  n  individual  outcome  probabilities;  to  obtain  compound 
probability  in  this  fashion,  the  individual  probabilities  must  be  mutually 
exclusive.  (See  Mood  I,  30-36).  Another  assumption  is  that 


.S,  n.=n. 

1=1  1 


Unless  this  is  the  case. 


n! 


"r  "2’ 


does  not 


give  the  number  of  distinguishable  orders  of  obtained  outcomes  as  re¬ 
quired  in  the  derivation,  and,  in  fact  becomes  meaningless.  Since 
k  mutually  exclusive  outcomes  are  recognized  as  possible, 


p.  must  equal  1.  Otherwise  a  real  probability,  1 

would  exist  for  outcomes  in  an  additional  category  or  categories  not 
considered,  (Furthermore,  the  occurrence  of  such  unc-ategor ized 

outcomes  would  mean  that  n  would  be  greater  than  Finally, 


since  p^,  p  ,  etc.  are  chance  probabilities,  sampling  must  be  random, 

i.e.  ,  the  n  events  or  trials  must  be  selected  on  a  chance  basis  from 
the  infinite  number  of  potential  events  available.  Specifically  this  means, 
among  other  things,  that  no  bias  shall  have  operated  to  exclude  valid  but 
’’unfavorable”  data  from  the  test. 


Use  of  the  multinomial  distribution  in  statistical  tests  requires 
that  the  probabilities  for  all  of  the  possible  outcomes  be  known  exactly 

n. 

k  ^i  ^ 

and  be  included  in  the  formula  n!  .n,  ^  .  It  is  important,  however,  to 

1—  X  n .  • 

1 

recognize  that  the  experimenter  is  free  to  define  both  the  sample  space 
in  which  he  is  interested  and  the  categories  which  divide  that  sample 
space  into  k  mutually  exclusive  parts.  The  experimenter  must,  in 
fact,  be  careful  to  do  this  in  such  a  way  as  to  define  precisely  that  situa¬ 
tion  in  which  he  is  interested.  If  he  fails  he  will  obtain  an  exact  prob¬ 
ability  for  a  situation  in  which  he  is  not  interested,  and  this  probability 
will  differ,  perhaps  considerably,  from  the  exact  probability  for  the 
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situation  in  which  his  interest  lies.  For  example,  in  coin  tossing, 
in  addition  to  **heads**  and  **tails**,  the  outcome  category  **on  the  rim** 
has  finite  probability  which  usually  cannot  be  specified.  Therefore, 
although  heads  and  tails  have  equal  probabilities,  these  probabilities 
are  unknown  since  their  sum  is  not  1.  By  defining  his  sample  space 
as  that  including  only  those  outcome  categories  in  which  the  coin  lands 
flat,  the  experimenter  enables  himself  to  specify  as  .  50  the  probability 
of  heads  and  the  probability  of  tails.  The  experimenter  is  no  better 
off,  however,  unless  his  interest  is  confined  to  the  sample  space  con¬ 
sisting  only  of  heads  and  tails,  i.  e.  ,  is  confined  to  the  frequency  of 
heads  relative  to  tails  rather  than  tosses.  Again,  the  experimenter 
may  be  interested  in  broader  categories  than  those  into  which  his  data 
are  fitted.  In  such  cases  he  should  use  the  categories  in  which  he  is 
interested  rather  than  those  in  which  the  data  are  available.  For  ex¬ 
ample,  in  tossing  two  coins  simultaneously  the  possible  outcomes  will 
be  defined  to  be  two  heads  (Pj.=  1/4),  a  head  and  a  tail  (Pp-  1/2),  and 
two  tails  (Pj.=  1/4).  Suppose  that  the  two  coins  have  been  simultan¬ 
eously  tossed  n  times  and  that  the  frequencies  of  the  respective  out¬ 
comes  named  above  are  n^  n2  and  n^.  If  the  experimenter  is  inter¬ 
ested  in  the  point  probability  of  the  obtained  frequencies  for  the  outcomes 

stated,  the  proper  formula  is  — — )  — i —  (l/2)^^.  On 

12  3 

the  other  hand,  if  he  is  interested  in  the  probability  of  the  obtained 
frequencies  for  the  recategorized  outcomes,  **coins  have  same  side 
up**  (P^=  l/2)  and  *  'coins  have  different  sides  up"  ^/2),  the  ob¬ 

tained  frequencies  are  n^+n^  and  n^  respectively  and  the  probability 


IS 


(n^+n^)! 


n 


2* 


(1/2) 


ni  +n3 


(1/2) 


nz 


The  probabilities  for  the  same 


data  under  the  two  different  categorizations  of  outcome  are  not  the 
same: 


n: 


n  ’’n^i  nJi 


ni+n3 


2 


1*  2*  3* 


- - r-  (1/4)"^^  vr  (l/2) 


ni  +n3 


tVt-  (1/2) 


ni+n3  o  _ L 


^1*  ^3* 


-  (nj^+n^)^ 
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(n^+n^): 


^  2^1  +^3 


N 

Substituting  N  for  n^+n^,  the  questioned  equality  becomes  (^  )  ?  2 

which  is  obviously  absurd  since  (^ )  varies  with  the  particular  values 

^1 

of  and  n^,  while  2  does  not,  varying  only  with  their  sum.  The 

reason  for  the  discrepancy  between  the  two  probabilities  is  that  one 
states  merely  that  n^^+n^  tosses  result  in  either  two  heads  or  two  tails 


.N 


without  specifying  precisely  how  many  of  these  shall  be  two  heads;  the 
other  probability  does  specify  this  further  and  much  more  restrictive 
information.  The  latter  probability  is,  therefore,  much  smaller  than 
the  former. 


The  multinomial  distribution  is  seldom  used  directly  as  the 
basis  of  a  statistical  test.  This  is  partly  attributable  to  the  fact  that 
the  exact  probabilities  for  the  various  outcome  categories,  although  re¬ 
quired  by  the  test,  are  seldom  known;  and  it  is  partly  because,  unless 
n  is  quite  small,  computation  of  cumulative  probabilities,  i.  e.  ,  signi¬ 
ficance  levels,  is  likely  to  be  extremely  time  consuming.  Nor  is  this 
distribution  extensively  tabled  except  for  the  special  case  where  k=2, 
i.  e.  ,  except  for  the  case  of  the  binomial  distribution.  The  reason  for 
the  lack  of  extensive  tables  is  obvious:  the  number,  2k  -  1 ,  of  required 
parameters  is  prohibitively  large. 


2.  The  Chi  Square  Approximation  to  the  Multinomial 

Because  chi-square  occupies  a  prominent  position  in  most 
elementary  statistical  texts  it  will  be  assumed  that  the  details  of  its 
application  are  familiar  to  the  reader.  Because  it  is  one  of  the  most 
misiinder stood  and  misused  of  statistical  tests,  its  theory  and  the  hazards 
of  its  misapplication  will  be  discussed  in  detail. 

The  chi-square  distribution  is  derived  from  the  multinomial. 
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three  approximations  being  required  in  the  derivation  and  therefore 
qualifying  the  use  of  chi-square*  The  first  approximation  consists 
in  replacing  the  factorials  in  the  multinomial 


nl 


by  their  Stirling  approximations* 


The  second  approximation  'hs  similar  in  character  to  the  familiar 

one  by  which  an  expression  of  the  form  (1+x/m)^  is  replaced  by  e^ 
when  m  is  large**  (27)*  The  final  approximation  consists  in  replacing 
by  an  integral  the  discrete  summation  representing  the  cumulative 
distribution  function. 


Each  of  these  three  approximations  presupposes  infinite, 
i.e.  very  large,  n*s  and  becomes  increasingly  poor  with  diminishing 
sample  size*  Each  is  strictly  valid  only  for  samples  of  infinite  size. 


The  first  two  approximations  together  are  equivalent  to  sub¬ 
stituting  for  the  multinomial  distribution  its  multivariate  normal 
approximation.  At  this  point  the  assumption  is  necessitated  that,  for 
each  category,  the  observed  frequencies  are  normally  distributed 
about  the  expected  frequency  as  a  mean.  For  a  single  multinomial 
category,  outcomes  are  binomially  distributed;  therefore  replacing 
the  multinomial  distribution  by  its  multivariate  normal  approximation 
is  equivalent  to  substituting  the  univariate  normal  distribution  for  the 
true  binomial  distribution  of  outcomes  within  each  multinomial  cate¬ 
gory.  In  fact,  the  working  formula  by  which  data  are  referred  to  the 
chi  square  tables  can,  for  the  case  of  one  degree  of  freedom,  be  easily 
derived  by  making  this  substitution.  Consider  a  binomial  variate 
with  the  probability  p  for  a  single  event.  The  point  probability  that 


it  will  occur  r  times  in  n  trials  is 


n* 

rj  (n-r)i 


r 

P 


(i-p) 


n-r 

f 


or,  if  the 


normal  approximation  is  used,  the  corresponding  cumulative  prob¬ 
ability  is  that  of  the  **normal**  deviate  x  ”  — .  f  ,  np  being  the 

n/  np(l-p) 

mean  and  n/  np(l-pj  the  standard  deviation  of  the  binomial  distribution. 
If  both  sides  of  the  equation  are  squared  and  numerator  and  denomin¬ 
ator  of  the  right  side  are  multiplied  by  n,  it  becomes 


2 

X 


n(r-np)^  . 
np(n-np) 


Now  substitute  f  for  r  and  f  for  np, 

oi  ei 


giving 
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(f  -f  (f  -f  Y 

0i  ei '  oi  ei 

^  =  T  (n-i  )  *  - 1 -  +  - 


n(f  -  f  )' 
2  Oi  ei ' 


ei  ei 


ei 


ei 


(f  -f  y 

Oi  ei  ' 


ei 


(n-f  -n+f  y 
oi  ei' 


n-f 


ei 


If  now  f  is  substituted  for  n-f  and  f  for  n-f  , 

62  61  02  Oi 


2 

X  = 


(f  -f  (f  -  f 

Oi  ei  02  62' 

- r -  +  - 7 -  which  is  the  formula  used  to  cal- 


ei 


ez 


culate  X  with  one  degree  of  freedom  from  data  in  which  f  and  f 

01  ex 

are  the  observed  and  expected  frequencies  of  occurrence  and  f  and 

02 

f  are  the  corresponding  frequencies  of  nonoccurrence.  (It  is  easily 
e2 

seen  from  the  foregoing  that  chi  is  normally  distributed  when  chi 
square  is  based  on  a  single  degree  of  freedom.) 


The  assumption  that  observed  frequencies  are  normally 
distributed  about  their  expected  frequency  is,  of  course,  incapable 
of  being  met  exactly  unless  n  is  infinite  at  which  point  the  binomial 
distribution  and  its  asymptotic  normal  ^'approximation**  are  identical. 
The  normality  assumption  is  therefore  equivalent  to  the  "assumption" 
that  n  is  infinite,  or,  since  the  expected  frequency,  f  ,  equals  np., 
that  all  expected  frequencies  are  infinite.  i 

In  more  practical  terms, 
the  "assumption"  of  normal  distribution  of  observed  frequencies  will 
be  negligibly  violated  if  the  following  conditions  exist:  (a)  n  is  so 
large  that  for  every  p.  .50,  the  true,  i.  e.  binomial,  distribution 
of  observed  frequencies  within  each  category  has  no  more  than  neg¬ 
ligible  asymmetry;  this  must  be  the  case  if  the  binomial  is  to  be 
well  approximated  by  the  "fitted"  normal  distribution  which  is  sym¬ 
metrical,  (b)  n  is  so  large  that  for  each  category  the  area  of  the 
"fitted"  normal  curve  covering  impossible  "observed"  frequencies, 
i.e.,  those  frequencies  which  are  less  than  zero  or  greater  than  n, 
is  negligible  relative  to  the  size  a  of  the  significance  level  being 
used  for  the  chi  square  test,  (c)  n  is  so  large  that  if  for  each  cate¬ 
gory  the  points  corresponding  to  observed  frequencies  in  the  binomial 
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distribution  of  observed  frequencies  were  connected  by  line  segments, 
the  result  would  have  the  appearance  of  a  smoothly  continuous  curve. 
The  smaller  the  smallest  p^is,  the  larger  n  must  be  to  produce  the 
effects  named;  and  the  smaller  the  significance  level  chosen  for  the 
chi  square  test,  the  greater  the  relative  importance  of  asymmetry, 
the  alleged  probability  of  impossible  frequencies,  and  discontinuity, 
and  therefore  the  larger  n  must  be  to  make  these  effects  negligible. 
The  term  **negligible*'  has  not  been,  and  will  not  be  defined.  Any 
subjective  definition  will  suffice  if  consistently  applied,  since,  in  the 
above  discussion,  that  degree  of  cause  which  is  defined  as  negligible 
will  have  an  effect  whose  degree  is  of  about  the  same  order  of  negli¬ 
gibility. 


Much  acrimonious  controversy  has  raged  over  the  question  of 
how  small  an  expected  frequency  can  be  safely  used  in  a  chi-square 
test.  The  reason  for  the  animosity  is  not  hard  to  find.  Since  for  any 
expected  frequency  short  of  infinity,  chi  square  is  an  approximation 
rather  than  an  exact  test,  the  question  of  how  small  an  expected  fre¬ 
quency  can  be  tolerated  resolves  itself  into  a  pure  matter  of  opinion 
as  to  how  close  an  approximation  is  **good**.  And  most  writers  have 
not  quantified  the  degree  of  approximation  which  they  find  tolerable 
other  than  by  specifying  a  minimum  acceptable  expected  frequency. 

The  most  popular  rule  of  thumb  appears  to  be  that  **no  expected  fre¬ 
quency  should  be  less  than  5*^  possibly  because  the  normal  approxi¬ 
mation  to  the  binomial  is  regarded  as  good  if  np  exceeds  5,  However, 
such  rules  overlook  the  fact  that  the  effect  of  an  assumption  violation 
is  usually  a  function  of  several  factors  only  one  of  which,  i.  e  expected 
frequency,  is  mentioned  in  the  rule.  For  example,  there  is  every  rea¬ 
son  to  believe  a  priori  that  (a)  the  variance  and  degree  of  symmetry  of 
the  sampling  distribution  of  observed  frequencies,  (b)  the  ^'height**  of 
the  significance  level  chosen,  and  (c)  the  number  of  categories,  will  be 
important  factors  in  determining  whether  or  not  the  use  of  an  expected 
frequency  as  low  as  5  will  have  an  appreciable  effect  upon  the  closeness 
of  approximation  of  the  chi  square  significance  level  to  the  **true**  multi¬ 
nomial  significance  level.  The  smaller  the  variance  of  the  true  sampling 
distribution  of  observed  frequencies  the  smaller  will  be  the  area  of  the 
normal  distribution,  assumed  for  them,  which  occupies  the  region  cor¬ 
responding  to  negative,  and  therefore  impossible,  frequencies.  And 
the  more  nearly  symmetrical  the  sampling  distribution  of  observed  fre¬ 
quencies  (i,  e,  ,  the  closer  p  is  to  ,  50  for  a  given  n),  the  better  it  will 
be  approximated  by  the  normal  distribution  it  is  **assumed**  to  have. 

Curve  **fits**  are  usually  poorest  at  their  tails,  therefore  the  distortion 
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of  the  chi-square  approximation  should  be  greater  the  higher  the  sig¬ 
nificance  level.  Finally,  since  chi-square  is  the  sum  of  squared 
deviations  divided  by  the  respective  expected  frequencies,  the  effect 
of  a  single  very  small  expected  frequency  in  a  large  number  of  cate¬ 
gories  would  exert  a  smaller  relative  influence  upon  the  sum,  and 
therefore  chi-square,  than  would  be  the  case  if  a  smaller  number  of 
categories  were  being  used.  Tables  III  and  IV  show  the  distorting 
effect  of  some  of  these  factors  upon  chi  square  probabilities  when  the 
expected  frequency  is  5  and  2  respectively.  For  other  studies  of  the 
sensitivity  of  chi  square  to  gross  violations  of  its  assumptions,  see 
(9,  36,  56,  59,  66). 

The  prohibition  against  small  expected  frequencies  has  led  to 
the  widely  accepted  practice  of  pooling  categories  in  order  to  bring  the 
expected  frequencies  for  the  combined  categories  up  to  the  required  size. 
Such  pooling,  however,  involves  an  arbitrary  decision  which  must  usually 
be  made  subsequent  to  the  collection  of  data.  Such  a  posteriori  mani¬ 
pulation  of  test  parameters,  i.  e.  categories,  in  effect  violates  the 
assumption  of  random  sampling  since  outcomes  are  being  influenced  by 
a  factor  other  than  chance.  This  objection  is  not  an  academic  one, 
since  the  manner  in  which  categories  are  combined  can  dramatically 
affect  the  significance  levels  obtained  for  a  given  set  of  data.  Qimbel 
(29)  gives  an  example  of  a  goodness  of  fit  test  in  which  probability 
levels  calculated  by  chi-square  from  the  same  data,  using  the  same  ab¬ 
scissa  interval  length  to  define  categories  (and  of  course  the  same  num¬ 
ber  of  categories  in  each  case),  vary  by  a  factor  of  30  depending  on  the 
point  chosen  for  the  beginning  of  the  first  interval.  When  dealing  with 
contingency  tables  the  expected  frequencies  are  usually  not  known  in 
advcUice  of  sampling,  being  calculated  from  the  marginal  observed  fre¬ 
quencies.  In  such  cases  the  experimenter  may  be  forced  to  choose 
between  a  posteriori  pooling  and  using  too  small  an  expected  frequency, 
assumptions  being  violated  under  either  alternative.  However,  in  test¬ 
ing  goodness  of  fit  to  a  completely  known  and  tabled  continuous  function 
the  issue  can  be  avoided  because  sufficient  information  is  available  to 
set,  in  advance,  the  minimum  expected  frequency  which  the  experimenter 
is  willing  to  tolerate.  The  **X-axis**  of  the  distribution  to  which  fit  is 
being  tested  is  divided  into  k  intervals  so  that  the  area  under  the  curve 
above  each  interval  is  the  same  for  every  interval,  each  such  area  there¬ 
fore  equaling  1 /k.  Each  interval  therefore  is  a  category  whose  prob¬ 
ability  is  1/k,  and  if  n  observations  are  taken,  the  expected  frequency 
for  each  category  is  n/k.  (See  42  and  63) 
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Influence  of  Sajnple  Size,  n,  C&tegory  Probability,  p,  and  Approximate 
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The  last  of  the  three  approximations  used  in  the  derivation 
of  the  chi  square  density  function  consisted  of  replacing  a  discrete 
sum  by  an  integral.  The  result  is  that  the  tabled  chi  square  distri¬ 
bution  is  continuous  while  the  multinomial  distribution  which  it  ap¬ 
proximates  is  discretely  distributed  as  is  the  "working  formula", 

2 

X  =  S  - 7 -  ,  by  which  **chi  square"  is  calculated  from 


obtained  data.  Substituting  an  integral  for  a  discrete  summation  is 
conscionable  only  when  the  discrete  function  involves  so  many  discrete 
values,  each  differing  so  slightly  from  the  adjacent  values,  as  to  be 
well  approximated  by  a  continuous  function.  When  expected  frequen¬ 
cies  are  small,  the  number  of  different  values  which  the  observed  fre¬ 
quencies  can  assume  is  quite  limited,  and  this  discrete  distribution 
is  not  well  approximated  by  the  continuous  chi-square  distribution. 
However,  when  chi  square  is  based  upon  a  single  degree  of  freedom, 
the  approximation  can  generally  be  improved  by  applying  Yates*  cor¬ 
rection  for  continuity  (66).  This  consists  of  reducing  the  absolute 
value  of  the  deviations  of  observed  from  expected  frequencies  by  1/Z 
prior  to  squaring  them  in  the  calculation  of  chi  square.  The  correc¬ 
tion  does  not  compensate  exactly  for  the  discontinuity  in  the  sampling 
distribution  from  which  the  obtained  data  were  "drawn";  it  may,  in  fact, 
3iggTSiV3,te  rather  than  reduce  the  error,  "In  symmetrical  and  nearly 
symmetrical  distributions"  .  ,  .  the  correction  overestimates  the  true.  .  . 
"probabilities  at  both  tails  and  under -estimates  them  near  the  centre 
of  the  distribution.  Such  discrepancies,  however,  are  small  compared 
with  those  arising  in  violently  xins ymmetrical  cases.  "  (66)  Generally 

Yates*  correction  is  an  improvement.  It  is  commonly  recommended 
for  calculating  chi  squares  based  on  one  degree  of  freedom,  (except 
when  /f^  -  f^/  <  1/2,  in  which  case  it  "overcorrects").  It  should  not 
be  used,  however,  in  calculating  individual  chi  squares,  with  one  de¬ 
gree  of  freedom,  which  are  to  be  added,  and  their  degrees  of  freedom 
summed,  to  obtain  a  total  chi-square.  (See  Chapter  IV  for  a  superior 
method  in  the  case  of  certain  fourfold  contingency  tables.  ) 


Since  the  multinomial  distribution  from  which  chi-square  is 
derived  applies  only  to  repeated  independent  events  the  chi-square  test 
is  equally  dependent  upon  the  assumption  that  each  of  the  occurrences 
of  an  event  comprising  a  frequency  of  occurrence  is  independent  of  all 
other  occurrences  of  the  event.  This  is  one  of  the  most  frequently 
violated  assumptions  of  chi  square  (40).  Also  traceable  to  the  multi¬ 
nomial  is  the  assumption  that  outcome  categories  are  mutually  exclu¬ 
sive. 
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If  p.  is  the  probability  that  a  single  event  will  have  an  outcome 
^  th 

which  places  it  in  the  i  category,  then  the  expected  frequency,  f  , 

for  that  category  is  np.,  where  n  is  the  number  of  times  the  event  ^ 

^  f 

is  permitted  to  occur.  Since  p^=  e^/n  and  since  Z;p.=  1,  it  follows 

that  Sf^  =  n.  Obviously  if  f  is  the  observed  frequency  for  category  i, 

i  ^i 

then  Sf  =  n,  or  Sf  =  Sf 
o.  o.  e. 

1  11 

tion:  the  sum  of  the  observed  frequencies  must  equal  the  sum  of  the 

expected  frequencies"  It  is  protably  most  frequently  violated  by  fail- 
in^ to  give  the  f  the  exact  decimal  values  calculated  for  them,  rounding 
i 

them  off  instead  to  whole  numbers. 


This  is  frequently  stated  as  an  assump- 


Another  assumption  is  that  the  introduction  of  information 
concerning  higher  moments,  such  as  the  variance  of  the  distribution 
in  a  test  of  fit,  does  not  alter  the  condition  expressed  by  the  equality 
S(f  -np.)=0.  This  is  expressed  in  the  requirement  that  necessary 
i 

equations  involving  the  above  equality  are  linear  and  homogeneous  in 
the  variables  (f  -np.).  (27) 

i 

w lien  useful  information  can  be  introduced  into  the  chi-square 
test,  such  as  the  variance  of  a  distribution  whose  "goodness  of  fit"  is 
being  tested,  the  effect  is  to  identify  and  specify  the  particular  values 
which  the  chi  variate  may  assume  in  one  of  the  dimensions  of  the  hyper- 
space  which  the  chi  distribution  occupie  .  The  effect  of  each  such 
restriction  is  to  reduce  by  one  the  number  of  dimensio’.s  in  which 
chi  is  "free"  to  vary.  The  number  of  such  free  dimensions  is  known 
as  the  number  of  degrees  of  freedom.  Fisher  (23)  presents  the 
rationale  for  this  reduction  as  follows.  "The  common  sense  of  this 
correction  lies  in  the  fact  that  when  the  population  with  which  the 
sample  is  compared  has  been  artificially  identified  with  the  sample 
in  Certain  respects,  such  as  marginal  frequencies,  or  the  moments, 
we  shall  evidently  make  an  exaggerated  estimate  of  the  closeness  of 
agreement  between  sample  and  population,  if  we  regard  the  sample 
as  an  unselected  sample  of  a  population  known  a  priori.  " 

Chi  square,  although  deceptively  simple  in  application,  is 
one  of  the  most  complicated  statistics  in  its  theoretical  basis.  It 
has  been  widely  misunderstood  by  professional  statisticians  as  well 
as  by  laymen.  Nearly  a  quarter  of  a  century  elapsed  after  Pearson's 
publication  of  the  original  article  on  chi  square  before  statisticians 
understood  how  degrees  of  freedom  are  affected  by  linear  restrictions 
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upon  the  data.  And  in  a  survey  (40)  of  the  use  of  chi  square  by  psy¬ 
chologists  publishing  in  a  professional  journal,  in  nine  out  of  fourteen 
articles  the  application  of  chi  square  was  found  to  be  **clearly  unwar¬ 
ranted”.  As  a  symptom  of  the  confusion  surrounding  its  use,  extended 
discussion  and  debate  has  surrounded  such  questions  as  the  correct 
number  of  degrees  of  freedom  (8,  20,  22,  23,  24,  39,  40,  48,  67) 
the  minimxim  tolerable  expected  frequency  (8,  19,  39,  40)  when  and 
how  to  apply  Yates’  correction  (1,  8,  11,  40,  66),  and  even  whether 
or  not  the  hypothesis  of  ”fit”  should  be  rejected  when  the  fit  is  so  good 
as  to  be  expected  rarely  (2,  6,  8,  58).  (Curiously  enough  the  affirm¬ 
ative  in  the  last  named  controversy  was  taken  by  no  less  a  statistician 
than  R,  A.  Fisher;  it  is  effectively  and  eloquently  rebutted  by  Stuart 
(58)  ). 


.Aside  from  its  complexity  chi-square  suffers  from  a  number 
of  practical  and  theoretical  shortcomings.  Whether  or  not  an  hypo¬ 
thesis  of  fit  will  be  rejected  may  depend  as  much  upon  the  statistician 
as  upon  the  obtained  data,  since  probabilities  may  be  greatly  affected, 
a  posteriori,  by  the  manner  in  which  the  data  are  grouped  into  ’’intervals” 
or  cells.  Since  all  deviations  are  squared  in  the  computation  of  chi- 
square,  the  test  is  comipletely  insensitive  to  the  directions  of  the  devia¬ 
tions,  regarding  a  series  of  unidirectional  deviations  as  no  more  sig¬ 
nificant  than  a  set  of  deviations,  varying  haphazardly  in  direction  from 
the  hypothesized  curve  but  having  the  same  absolute  magnitudes.  Appli¬ 
cations  of  chi-square  in  which,  for  a  given  sample  size,  all  expected 
frequencies  can  be  specified  in  advance  of  sampling  are  relatively  rare. 
However,  it  is  only  in  such  cases  that  the  chi-square  test  is  truly  para¬ 
meter-free.  In  all  other  cases  chi  square  is  parametric  in  the  sense 
that  population  parameters,  e,  g.  ,  expected  frequencies  in  a  contingency 
table  or  the  variance  of  a  ’’fitted”  distribution,  must  be  estimated  a 
posteriori  from  sample  data.  And  in  such  applications  the  excellence 
of  the  test,  i.  e.  ,  the  accuracy  of  its  calculated  probabilities,  depends 
upon  the  efficiency  of  the  estimates  and  upon  their  accuracy  in  the 
particular  case  in  question.  In  the  sense  that  chi-square  assumes 
the  sampling  distribution  of  observed  frequencies  in  each  category  to 
be  normally  distributed,  it  is  not  ’’distribution-free”.  More  accurately 
phrased,  chi  square  falsely  assumes  a  multivariate  normal  distribu¬ 
tion  in  cases  where  the  true  distribution  must  necessarily  be  the  mul¬ 
tinomial.  Because  of  its  resort  to  such  approximations,  it  is  an  in¬ 
exact  test. 

Because  of  its  many  shortcomings,  other  tests,  such  as  the 
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Kolmogorov-Smirnov  test  of  fit  will,  in  most  cases,  be  preferable. 

In  some  few  cases  chi  square  may  be  desirable  because  of  its  addi¬ 
tive  property  or  because  of  its  ability  to  make  allowance  for  the 
identification  of  parameters  in  the  hypothesized  population  on  the 
basis  of  data  whose  fit  is  being  tested.  However,  unless  such  unique 
properties  are  required,  it  will  be  wise  to  seek  another  test;  and, 
when  its  avoidance  is  impossible,  chi  square  should  be  used  with 
great  caution. 


SUMMARY 


The  multinomial  test  assumes  random  sampling  of  events  whose 
outcomes  are  independent  and  fall  into  mutually  exclusive  categories, 
the  sum  of  whose  probabilities  is  unity.  The  test  yields  probabilities 
which,  for  a  given  set  of  data,  vary  with  the  system  of  categorization 
used.  The  practical  validity  of  the  test  therefore  depends  upon  de¬ 
fining  and  establishing  categories  which  correspond  precisely  with 
the  situation  to  which  the  experimenter  wishes  to  extend  statistical 
inference.  Although  it  is  an  exact  test,  it  may  require  prohibitively 
extensive  computation,  especially  when  n  is  large,  since  tables  are 
not  available  for  the  case  of  more  than  two  categories. 


The  chi  square  test  is  extensively  tabled  and  was  designed  for 
those  situations  in  which  the  multinomial  test  would  be  appropriate 
if  computation  of  probabilities  were  easier.  The  chi  square  distri¬ 
bution  was,  in  fact,  derived  from  the  multinomial  distribution,  the 
derivation  having  entailed  three  asymptotically  valid  approximations. 

2 

It  is  the  asymptotic  distribution  for  the  statistic  X  =  S  - 


which,  at  finite  sample  sizes  is  an  inexact  statistic. 


Because  of  its  relationship  to  the  multinomial,  the  chi  square 
test  incorporates  all  of  the  assumptions  on  which  the  multinomial  is 
based.  It  therefore  assumes  that  events  are  randomly  sampled, 
that  possible  outcomes,  i.  e.  categories,  are  mutually  exclusive, 
that  actual  outcomes  are  independent,  that  Sp^=  1  and 
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Further  assumptions  are  required  due  to  steps  taken  in  the 
derivation  of  chi  square.  Within  each  multinomial  category  the 
frequency  of  occurrence  is  binomially  distributed  with  mean  equal 
to  np.  However,  the  derivation  regards  this  frequency  as  normally 
distributed.  The  chi  square  test  therefore  makes  the  assumption 
that  within  each  chi  square  "cell"  the  population  of  "observed"  fre¬ 
quencies  is  normally  distributed  about  the  expected  frequency,  np, 
as  a  mean.  This  is  equivalent  to  assuming  infinite  n,  since  it  is 
only  for  that  case  that  the  binomial  can  be  exactly  fitted  by  a  normal 
distribution,  and  since  f^  z  np,  it  is  equivalent  to  assuming  infinite 
expected  frequencies.  Another  assumption,  traceable  to  the  deri¬ 
vation  is  that  all  restrictions  on  the  data  are  both  linear  and  homo¬ 
geneous. 

Chi  square  will  not  be  a  good  approximate  test  unless  the  binomial 
distribution  of  observed  frequencies  within  each  category  is  well  ap¬ 
proximated  by  a  normal  distribution.  The  normal  approximation 
worsens  with  increasingly  remote  tail  positions,  with  increasing 
asymmetry  of  the  binomial  and  with  decreasing  sample  size.  There¬ 
fore  the  accuracy  of  the  chi  square  test  is  a  function  of  a,  the  sig¬ 
nificance  level,  P-,  the  probability  that  a  single  event  will  have  an  out- 
come  in  the  i  category,  and  n,  the  total  number  of  events.  The 
rule  that  no  expected  frequency,  f^  =  np,  should  be  less  than  5  is 
a  poor  one  since  the  accuracy  of  the  chi  square  test  varies  widely 
with  the  individual  values  of  n  and  p  as  well  as  with  their  product 
and  since  the  rule  says  nothing  about  a. 


The  tabled  chi  square  distribution  is  a  continuous  one.  The 

2  <^0  - 

distribution  of  the  value,  X  =  S  - '  by  which  "chi 


e 

square"  is  calculated  from  obtained  data,  must,  however,  have  a 
discrete  distribution  since  observed  frequencies  are  necessarily  in¬ 
tegers.  This  introduces  an  error  which  can  usually  be  reduced, 
but  is  not  entirely  removed,  by  applying  Yates*  correction  for  con¬ 
tinuity  when  chi  square  is  based  upon  a  single  degree  of  freedom. 

It  should  not  be  applied,  however,  if  the  individual  chi  squares  are 
to  be  added  to  obtain  a  total  chi  square. 


If  "natural"  categories  are  combined  or  "pooled"  in  order  to 
increase  the  size  of  expected  frequencies  or  in  order  to  shorten 
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computations,  the  redefinition  of  categories  changes  the  situation  to 
which  ’*fit’*  is  being  tested.  It  therefore  alters  the  null  hypothesis 
in  a  way  which  is  fairly  obvious  in  the  case  of  contingency  tables, 
more  subtle  in  the  case  of  tests  for  goodness  of  fit  (where  the  null 
hypothesis  actually  being  tested  is  that  the  various  categories  have 
the  expected  frequencies  assigned  to  them,  not  that  the  two  ’^curves** 
are  identical).  This  combining  of  categories  may  obscure  a  real 
effect  and  lead  to  ’^acceptance**  when  the  uncombined  data  actually 
call  for  **r ejection**  of  the  hypothesis  in  which  the  experimenter  ac¬ 
tually  is  interested,  or  it  may  do  the  opposite.  Furthermore,  in 
tests  of  goodness  of  fit  to  a  continuous  distribution,  not  only  will  the 
choice  of  interval  length  affect  the  obtained  significance  level,  but 
even  the  choice  of  the  point  at  which  to  begin  the  leftmost  or  right¬ 
most  abscissa  interval  may  have  a  profound  effect  upon  the  signifi¬ 
cance  level  obtained.  In  fact  profound  effects  may  attend  any  situa¬ 
tion  in  which  categories  are  determined  on  the  basis  of  a  posteriori 
expediency  rather  than  by  a  **natural**  discrimination  between  pre¬ 
cisely  those  event  outcomes  in  which  the  experimenter  is  interested. 

Although  chi  square  is  extremely  complicated  in  its  derivation, 
its  simplicity  of  actual  computational  application  has  made  it  a 
favorite  among  the  statistically  naive.  This  treacherous  combina¬ 
tion  of  theoretical  complexity  and  deceptive  simplicity  in  practical 
application  has  made  it  a  perennially  misused  statistic.  Even 
mathematical  statisticians,  including  those  originating  it  and  modi¬ 
fying  it,  have  experienced  great  difficulty  in  determining  its  proper 
use  and  even  greater  lack  of  success  in  explaining  it  to  lay  statis¬ 
ticians.  Therefore  research  workers  will  be  well  advised  to  check 
thoroughly  into  the  theoretical  admissibility  of  any  contemplated 
application  of  this  statistic.  Those  not  possessing  the  requisite 
sophistication  for  such  an  undertaking  are  urged  to  shun  chi  square. 
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CHAPTER  IV 


EXACT  TREATMENT  OF  FREQUENCY  DATA  IN  FOURFOLD  TABLES 

A  test  statistic  having  a  **binomial**  derivation  (but  not  a  binomial 
distribution)  can  be  used  to  test  whether  or  not  two  samples  dichoto¬ 
mized  into  A^s  and  B*s  came  from  populations  with  equal  A/B  ratios. 
Tests  of  this  type  use  only  frequency  data  and  are  easy  to  apply.  De¬ 
pending  upon  the  choice  of  dichotomous  categories,  the  method  may 
be  used  to  test  for  equal  A/B  ratios,  or  may  be  used  to  test  for  loca¬ 
tion,  dispersion,  correlation,  or  trend.  The  method  can  be  regarded 
as  an  application  of  Fisher*s  Method  of  Randomization  (See  next 
chapter)  to  observation  frequencies  rather  than  their  magnitudes; 
and,  in  this  context,  it  is  of  historical  importance  in  the  development 
of  distribution-free  methods. 
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1.  Fisher's  Exact  Method 


a.  Rationale.  Suppose  that  two  populations,  differing  perhaps 
in  many  ways,  nevertheless  each  consist  entirely  of  units  which  belong 
to  one  or  the  other  of  two  mutually  exclusive  categories,  A  eind  B. 
Suppose  further  that  a  sample  has  been  drawn  from  each  population  and 
the  experimenter  wishes  to  test  the  hypothesis  that  the  proportion  of 
A's  in  Population  I  is  the  same  as  that  in  Population  II.  Letting  the 
frequency  data  be  represented  by  the  table  shown  below. 


Category 


A 

B 

Total 

Sample  I 

a 

b 

m 

Sample  II 

c 

d 

n 

Total 

r 

s 

N 

if  the  hypothesis  is  true  one  would  expect  cell  frequency  a  to  be  such 
that,  on  the  average,  the  proportion,  a/m  of  A's  in  Sample  I  would  equal 
the  proportion,  c/n,  of  A's  in  Sample  II.  Therefore  one  might  reason¬ 
ably  reject  the  null  hypothesis  of  equal  proportions  of  A's,  at  the  a 
level  of  significance,  if  the  obtained  cell  frequency  a  is  among  that  pro¬ 
portion,  a,  of  possible  values  of  a  which  cause  a/m  to  differ  from  c/n 
by  the  greatest  amount. 

If  the  validity  of  the  hypothesis  be  accepted,  it  follows  that  the 
true  proportion  of  A's  among  the  A's  cind  B’s  in  Population  I,  in  Popu¬ 
lation  II  and  in  both  populations  combined,  is  the  same.  Let  p  be  this 
common,  but  unknown,  proportion.  If  the  null  hypothesis  is  true,  then, 
the  probability  of  the  obtained  cell  frequencies,  within  that  set  of  events 
in  which  m  units  have  been  drawn  from  Population  I  and  n  units  from 

Population  II,  is  the  product  of  two  binomial  probabilities,  being 

(’^)p^  (1-p)^  (”)  p^  (1-p)^  or  C?)  (^)  p^  (1-p)^.  The  probability  that 

3.  C  d  C 

of  the  N  units  in  samples  I  cind  II  combined,  r  will  fall  in  category  A 
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Nr  s 

and  s  in  category  B  is  (  ^)  p  (1-p)  .  Therefore  the  probability  of 

the  obtained  cell  frequencies  within  that  set  of  N  events  in  which  m 
and  n  units  are  drawn  from  the  respective  populations,  I  and  II,  and 
r  and  s  units  fall  into  the  respective  column  categories,  A  and  B, 

(^)P"(1-P)^ 

is  -  .  Since  the  unknown  proportion,  p,  cancels 

,N.  r  .s 

(j.)P  (1-p) 

out,  the  probability  of  exactly  the  obtained  cell  frequencies  with  com¬ 
pletely  specified  marginal  frequencies  m,  n,  r  and  s  as  shown  is 

mj  nj  rj  sj 

Ni  aj  b:  ci  di 


Since  marginal  frequencies  are  constants,  this  probability 
can  be  expressed  in  terms  of  a  single  cell  frequency,  becoming 

- -  .  This  is  the  probability  for  exactly 

Ni  al  (m-a)l  (r-a)i  (n-r+a)] 

the  set  of  cell  frequencies  obtained,  i.  e.  ,  it  is  a  point  probability. 
The  probability  required,  however,  is  the  cumulative  probability  for 
those  sets  of  cell  frequencies  which  cause  the  greatest  difference 
between  the  proportions  a/m  and  c/n.  Therefore  the  probability 


- - - - - - - - -  must  be  cumulated  over  those  values 

N]  al  (m-a)l  (r-a)l  (n-r+a)l 

of  a  causing  differences  between  the  proportions  a/m  and  c/n  as 
great  as  or  greater  than  that  existing  in  the  obtained  table.  If 
this  cumulated  probability  is  less  than,  a,  the  significance  level 
chosen,  the  null  hypothesis  is  rejected. 

b.  Null  Hypothesis.  The  proportion  of  A's  in  Population  I 
is  the  same  as  the  proportion  of  A's  in  Population  II, 

c.  Assumptions,  (1)  Sampling  is  randorn,  (2)  the  N  units 
are  independent,  i.e,,  to  what  categories  a  unit  will  belong  is  unin¬ 
flue  nc^H^5yTEe~c  at  ego  ries  to  which  any  other  unit  belongs  (This 
assumption  applies  to  the  generation  of  the  "table"  and  its  marginal 
frequencies,  and  therefore  is  not  in  conflict  with  the  fact  that  the 
table  is  completely  specified  by  its  marginal  frequencies  and  a  single 
cell  frequency.),  (3)  the  two  row  categories  are  mutually  exclusive 
as  are  the  two  column  categories,  (4)  the  "A  or  B"  dichotomy  repre- 
sents  all  possible  "column"  outcomes  and  the  "I  or  II"  dichotomy. 
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all  "row"  outcomes  (or,  alternatively,  sampling  and  statistical  in¬ 
ference  are  restricted  to  that  set  of  units  capable  of  being  dichoto¬ 
mized  A  or  B  in  regard  to  one  measured  characteristic  and  I  or  II 
in  regard  to  another).  These  assumptions  are  directly  related 
to  the  assumptions  of  the  binomials  used  in  the  derivation  of  the 
test.  The  assumption  of  independence  is  also  occasioned  by  the 
fact  that  the  probability  for  the  obtained  table  was  obtained  by  tak¬ 
ing  the  product  of  the  separate  probabilities  for  the  results  in  eachrow. 

d.  Efficiency  and  Power.  In  a  sense  the  test  is  perfectly 
"efficient"  since  it  is  an  exact  method  which  uses  all  of  the  "infor¬ 
mation"  in  the  sample;  parametric  tests  for  the  same  problem 
merely  substitute  the  normal  approximation  for  the  true  binomial 
distribution  of  frequencies  within  a  cell  and  therefore  use  the  same 
"information"  but  use  it  somewhat  inaccurately.  In  the  practical, 
computational  sense,  the  test  is  inefficient  for  moderate  and  large 
samples  if  computation  must  be  carried  out  without  the  aid  of  tables. 
Such  tables  do,  however,  exist  for  small  and  moderate  size  samples 
so  the  test  may  be  regarded  as  practically  inefficient  only  for  appli¬ 
cation  to  large  samples, 

e.  Application.  To  illustrate  the  application  of  this  test, 
suppose  that  an  experimenter  has  obtained  the  frequency  data  shown 
in  the  table  below  and  wishes  to  test  whether  the  true  survival  rate 
of  persons  afflicted  by  a  rare  disease  is  the  same  for  men  as  for 
women. 


Survived  Died 


Men 

4 

10 

14 

Women 

9 

1 

10 

13 

11 

24 

The  proportion  of  men  surviving  is  4/14  or  .  2857  while  that  for 
women  is  9/10  or  .90,  and  the  difference  between  the  two  obtained 
proportions  is  .6143.  Tables,  with  the  same  marginal  totals,  in 
which  the  difference  between  the  proportions  surviving  is  as  great 
or  greater  than  ,6l43  are  shown  below. 


3 

11 

13 

1 

12 

2 

10 

0 

0 

10 

1 

9 
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The  values  of  a  which  cause  the  sex  difference  in  the  proportion  sur¬ 
viving  to  be  as  great  or  greater  than  that  in  the  obtained  table  are  3, 

4,  12  and  13.  Therefore  the  chance  probability  for  results  as  ex¬ 
treme  as  those  obtained,  if  there  actually  is  no  sex-fatality  rate  inter- 

m  ^  I*  ^  s  ^ 

action  is  S - ; - \ — l — *---  , - — “  with  the  summation  being 

N!  al  (m-a)!  (r-a)!  (n-r+a)!  ^ 

taken  over  the  values  a  =  3,  4,  12  and  1*3  for  a  two  tailed  test.  This 
probability  is  .  00226.  For  the  one-sided  hypothesis  that  the  survival 
rate  for  men  is  either  greater  than  or  equal  to  that  for  women,  the 
summation  is  taken  over  a  3  and  a  =  4  which  gives  a  probability  of 
.  00208,  i.e.  ,  which  is  ’’significant"  at  the  .  00208  level  for  a  one- 
tailed  test.  For  the  opposite  hypothesis  that  the  survival  rate  for 
men  is  either  less  than  or  equal  to  that  for  women,  the  summation 
would  be  taken  over  the  values  a  =  13,  12,  11,  10,  etc.  ,  until  the 
cumulative  probability,  on  the  next  addition,  would  have  exceeded 
the  one-tailed  significance  level.  Obviously  for  ordinary  signifi¬ 
cance  levels,  this  point  would  be  reached  before  the  probability  for 
a  =  4  was  required  in  the  summation,  and  since  the  critical  region 
did  not  include  the  actually  obtained  value,  a  =  4,  the  hypothesis 
could  not  be  rejected. 

f.  Discussion.  The  propriety  of  Fisher’s  Exact  Method  has 
been  the  subject  of  animated  controversy  among  distinguished  statisti¬ 
cians  (2,  5,  14,  17,  29,  30,  38,  45).  Some  have  objected  that  a  test 
which  necessarily  takes  marginal  totals  as  fixed  is  therefore  a  ’’condi¬ 
tional’’  test  and  cannot  properly  be  used  as  a  basis  for  statistical  in¬ 
ference  to  a  larger,  unrestricted  population.  The  principle  against 
which  these  objections  were  raised  has  subsequently  become  the  basis 
of  a  number  of  distribution-free  tests.  It  is  that  if  two  samples  of 
sizes  m  and  n  have  been  drawn  from  identical  populations,  they  may 
be  regarded  as  a  single  random  sample  of  size  m+n  from  the  common 
population.  The  two  original  samples  may  therefore  be  regarded  as 
having  been  obtained  by  randomly  assigning  the  label,  ’’Sample  I**  to 
m  of  the  m+n  units  in  the  ’’combined’’  sample,  the  n  remaining  units 
being  labeled  ’’Sample  II’’.  The  degree  to  which  Samples  I  and  II 
differ,  in  any  specified  measure,  not  directly  related  to  size,  is 
therefore  a  matter  of  chance.  The  chance  probability  of  the  observed 

difference  can  therefore  be  obtained  by  forming  all  of  the 

n 

different  possible  ’’splits’’  of  the  common  sample  of  m+n  units  into 
two  samples  of  sizes  m  and  n  and  by  determining  in  what  propor¬ 
tion  of  them  the  specified  measure  differs  by  an  amount  as  great  or 
greater  than  that  actually  obtained. 
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Applying  this  approach  to  the  fourfold  table,  the  marginal 
totals  r  and  s  may  be  regarded  as  the  "parent"  sample.  There 
r  "I"  s 

are  (  "ways  of  splitting  this  sample  into  two  samples  of 

m  and  n  units.  The  frequencies  a  and  c  can  only  be  obtained 
from  r  and  there  are  (  )  ways  in  which  precisely  these  frequencies 

Si 

can  be  obtained  for  Samples  I  and  II  respectively.  For  each  such 

s  * 

way,  there  are  (^)  ways  of  obtaining  the  frequencies  b  and  d.  Thus 

the  point  probability  of  the  obtained  table,  given  its  marginal  totals, 

'a^  'b^  m:  nl  rl  si  ^  ^ 

IS  - - or  - p-r-j - p— p| —  ,  and  the  cumulative  probability 

C) 

m 

is  obtained  by  summing  the  point  probabilities  for  the  appropriate 
values  of  a. 


A  number  of  different  kinds  of  data  can  logically  be  cast  into 
a  fourfold  table  and  a  variety  of  hypotheses  concerning  the  data  are 
possible.  Furthermore,  the  validity  of  a  given  hypothesis  can  be 
tested  by  a  number  of  methods,  although  perhaps  varying  considerably 
in  efficiency  and  logical  appeal.  These  points  have  been  made  by 
critics  of  the  method  (2,  5,  19,  29).  However,  the  Exact  Method 
appears  to  be  impeccable  when  used  to  test  the  null  hypothesis  that 
the  unknown  proportion  of  A*s  in  two  independent  populations,  capable 
of  being  dichotomized  into  mutually  exclusive  categories,  A  and  B, 
is  the  same,  and  when  the  sample  sizes  m  and  n  are  determined  in 
advance  of  sampling. 

Unless  samples  are  of  equal  size,  the  probability  of  a  will 
not  be  the  same  as  the  probability  of  the  **opposite  deviation**,  m-a, 
in  the  upper  left  cell.  Therefore,  although  when  m=n  two  tailed 
probabilities  can  be  obtained  by  doubling 

mi  ni  ri  si 

=n  - 

^  N:  x:  (m-x)J  (r-x):  (n-r+x): 

^ -  ,  whichever  is  smaller,  when 

Ni  xi  (m-x)i  (r-x)i  (n-r+x)i 

samples  are  of  unequal  size  the  summation  must  be  taken  over  those 
extreme  values  of  a  which  cause  the  absolute  difference  |  a/m  -  c/n( 
to  be  as  great  or  greater  than  is  the  case  in  the  actually  obtained 
table.  Furthermore,  since  a  is  an  integer  its  probability  is  dis¬ 
cretely  distributed  and  there  is  not  likely  to  be  correspondence 
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between  integral  values  of  a  and  the  standard  significance  levels, 
.05,  .01,  and  .001.  If  the  experimenter  has  some  compelling 
reason  for  wishing  to  use  these  standard  significance  levels  he 
may  employ  Tocher *s  (38)  modification:  If  the  obtained  cumulative 
probability  is  less  than  the  standard  significance  level,  a,  results 
are  considered  significant  at  the  level  a*  If  the  obtained  cumula¬ 
tive  probability  exceeds  a  but  would  be  less  than  a  if  cumulated 
for  a  value  of  a  one  unit  more  extreme,  (*^more  extreme*’  values 
being  understood  to  be  those  causing  a  larger  absolute  difference 
I  a/m  -  c/nj  ),  Tocher  computes  the  ratio 


a  -  Pr  (more  extreme  a’s) 

Pr  (observed  a  or  more  extreme  a*s) 


He  then  enters  a  table  of 


random  numbers  running  from  0  to  1  and  randomly  selects  a  number. 
If  the  number  selected  is  smaller  than  the  above  ratio,  results  are 
considered  to  have  fallen  within  the  a  level  of  significance  and  the 
null  hypothesis  is  rejected. 


g.  Tables.  A  number  of  tables  (12,  13,  20,  21,  22,  23, 

24,  42)  have  been  prepared  expressly  for  use  vith  Fisher’s  Exact 
Method.  Some  have  used  Fisher’s  Exact  Method  to  calculate  prob¬ 
abilities  when  N  is  small,  but  have  resorted  to  chi  square  with  Yates’ 
correction  when  N  exceeds  a  certain  value.  In  some  of  the  tables 
it  is  suggested  that  two-tailed  probabilities  can  be  obtained  by  doubling 
the  one-tailed  probability  listed  in  the  table.  This,  of  course,  is 
strictly  legitimate  only  if  the  distribution  of  the  test  statistic  is  sym¬ 
metrical  which,  in  fact,  is  the  case  only  when  the  two  samples  are 
of  equal  size. 


When  N  is  small  or  when  the  significance  level  is  extreme, 
probabilities  may  be  obtained  by  a  method  described  by  Mosteller 

r  ,m .  a,-  .m-aTr,n.  c,, 

[(^^  )  P  (I-P)  ][(c^P  (I’P)  ] 

(11-34).  The  point  probability  of  a  is  - - 

[(pp'  (i-p)  1 

Each  of  the  bracketed  expressions  is  a  binomial  probability,  and 
since  the  terms  involving  p  cancel  out,  p  may  be  arbitrarily  assigned 
any  constant  value  and  the  bracketed  probabilities  can  then  be  ob¬ 
tained  from  tables  of  the  point  binomial.  Thus  the  point  probabilities 
of  the  most  extreme  values  of  a  can  De  calculated  and  then  cumulated. 


h.  Sources.  2,  5,  9,  12,  13,  14,  15,  17,  18,  20,  21,  22,23, 
24,  29,  38,  42,  44,  45,  46,  47.  See  also:  1,  3,  6,  16,  19,  28,  30, 

31,  32,  33,  34,  36,  39,  48. 
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2.  Westenberg’s  Median  Test 


a.  Rationale ♦  Let  two  samples  of  measurements  be  taken, 
one  from  Population  I,  the  other  from  Population  II  and  let  M  be  the 
median  measurement  of  the  pooled  samples.  If  a  and  c  are  the 
respective  numbers  of  measurements  in  Samples  I  and  II  which  ex¬ 
ceed  M  and  if  b  and  d  are  the  corresponding  numbers  of  measure¬ 
ments  which  are  less  than  M,  the  data  can  be  arranged  in  a  fourfold 
table  as  follows  and  Fisher  *s  Exact  Method  can  be  used  to  determine 
the  probability  that  the  proportion  of  measurements  in  Population  I 
which  exceed  M  is  the  same  as  the  proportion  of  measurements 
greater  than  M  in  Population  II. 


Above  M  Below  M 


Sample  I 

a 

- E - 

m 

Sample  II 

c 

a 

n 

N/2 

N/2 

N 

If  the  pooled  sample  median  M  be  regarded  as  an  estimate  of  the 
pooled  population  median,  the  test  can  be  used  to  test  the^ hypothesis 
that  Populations  I  and  II  have  identical  medians.  Otherwise  it  simply 
tests  whether  the  value  M  splits  Populations  I  and  II  into  the  same, 
but  unknown,  proportions. 

b.  Null  Hypothesis.  The  proportion  of  measurements  which 
lie  above  the  median  of  Samples  I  and  II  combined  is  the  same  for 
Population  I  as  for  Population  II. 

A  sufficient,  but  not  a  necessary,  condition  for  the  validity 
of  the  null  hypothesis  is  that  Populations  I  and  II  be  identical.  There¬ 
fore  rejection  of  the  null  hypothesis  is  equivalent  to  rejection  of  the 
hypothesis  of  identical  populations,  but  failure  to  reject  the  null  hy¬ 
pothesis  is  not  equivalent  to  failure  to  reject  the  hypothesis  of  iden¬ 
tical  populations. 

c.  Assumptions.  As  does  Fisher’s  Exact  Method,  the  test 
assumes  random  sampling,  dichotomized  and  mutually  exclusive 
categories  for  both  rows  and  columns,  and  a s aume s  that ,  in  the  pr o - 
cess  of  sampling,  each  measurement  value  is  independent  of  the 
value  of  every  other  measure  (even  though  measurements  are  not 
independent  in  their  a  posteriori  categorization).  The  median  test 
assumes  further  that  both  populations  are  continuously  distributed 
so  that  no  measurements  will  be  tied  with  M. 
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d.  Treatment  of  Ties.  Tied  scores  are  a  problem  only  when 
tied  with  M.  If  such  ties  constitute  only  a  small  proportion  of  N, 
half  of  the  scores  in  each  sample  which  are  tied  with  M  may  be  cate¬ 
gorized  as  *'above  M**,  half  as  **below  M**,  If,  in  a  given  sample 
there  are  an  odd  number  of  such  ties,  the  odd  tie  may  be  discarded 
and  the  sample  size  reduced  by  one,  or  the  odd  tie  may  be  categorized 
in  whichever  way  will  be  least  conducive  to  rejection  of  the  null  hypo¬ 
thesis.  For  a  more  conservative  test,  all  scores  tied  with  M  may 

be  categorized  in  the  manner  least  conducive  to  rejection, 

e.  Efficiency,  The  asymptotic  efficiency  of  the  median  test 
for  location  relative  to  Student's  t-test,  when  both  tests  are  applied 
to  normal  populations  with  equal  variances,  was  found  by  Mood  (V- 
37)  to  be  Z/tt  or  ,637,  Mood  qualified  his  findings  as  resting  upon 
certain  unproved  assumptions.  Dixon  (XI-13)  found  the  power  effic¬ 
iency  of  the  test,  when  sample  sizes  are  very  small,  to  be  inferior 

to  that  of  the  Wilcoxon  test  and  to  that  of  the  Maximum  Absolute  Dev¬ 
iation  test  when  all  three  tests  were  applied  to  test  the  difference  in 
means  of  two  samples  drawn  from  normal  populations  with  equal 
variances,  Lehmann  (1-31)  examined  the  relative  power  of  six  non- 
parametric  tests  when  based  on  two  small  samples  of  equal  size  from 
two  quite  different  continuous  distributions.  Ranked  in  order  of  de¬ 
creasing  power  the  tests  were:  Lehmann's  "Most  Powerful"  test  for 
the  specific  situation  tested  (one-tailed  test),  the  Mann-Whitney  test 
(one-tailed),  Westenberg's  Median  test  (one-tailed),  the  Mann-Whitney 
test  (two-tailed),  Westenberg's  Median  test  (two-tailed),  and  finally 
the  Wald-Wolfowitz  Total  Number  of  Runs  test.  Roughly,  the  median 
test  was  about  7  5%  as  powerful  as  the  Mann-Whitney  test.  Apparently 
on  the  basis  of  these  and  his  own  results,  Van  der  Waerden  (1-52) 
concludes  that  the  median  test  generally  is  less  powerful  than  his  X 
test. 


f.  Application.  Fix  sample  size  in  advance  and  draw  a  sample 
from  each  of  the  two  populations.  Find  the  median,  M,  of  the  two 
samples  when  pooled,  then  determine  the  number  of  scores,  a,  in 
Sample  I  which  are  above,  and  the  number,  b,  which  are  below  M, 
counting  half  of  the  scores  tied  with  M  as  "above",  half  as  "below", 
and  discarding  any  odd  tie.  Find  the  corresponding  numbers,  c, 
and,  d,  for  Sample  II,  then  construct  the  frequency  table  shown  in 
"Rationale"  with  m  =  a+b,  n  »  c+d  and  N  *  m+n.  Under  this  proce¬ 
dure,  the  frequency  data  entered  in  the  fourfold  table  does  not  include 
the  median  score  M,  and  the  cell  and  marginal  frequencies  do  not  re¬ 
present  any  discarded  odd  ties.  From  this  point  on,  application  is  the 
same  as  for  Fisher's  Exact  Method, 


88 


g.  Discussion.  The  hypothesis  actually  tested  is  that  equal 
proportions  of  Populations  I  and  II  lie  above,  and  equal  proportions 
lie  below,  the  pooled  sample  median.  K  the  pooled  sample  median, 

M,  were  the  same  value  as  the  median  of  the  pooled  populations,  the 
test  would  test  whether  or  not  Populations  I  and  II  had  identical  medi¬ 
ans.  However,  this  is  almost  certain  not  to  be  the  case.  When  N 
is  small  the  pooled  sample  median  and  the  pooled  population  median 
may  differ  quite  appreciably;  for  large  values  of  N,  however,  the 
difference  can  be  expected  to  be  relatively  small.  Phrased  differ¬ 
ently,  the  median  test  tests  whether  or  not  the  value,  M,  represents 
the  same,  but  unknown,  quantile  in  the  two  populations.  If  the  null 
hypothesis  is  true  and  N  is  large  this  unknown  quantile  will  be  a  pro¬ 
portion  very  close  to  .  5  and  the  score  M  will  be  very  nearly  the  com¬ 
mon  median  of  the  two  populations.  The  median  test  can,  in  this  case, 
be  regarded  in  an  approximate  sense  as  a  test  for  identical  population 
medians.  However,  when  N  is  small,  the  validity  of  the  null  hypothesis 
does  not  insure  that  the  unknown  quantile  represented  by  M  will  be  in 
the  neighborhood  of  ,  5,  and  the  test  can  only  be  considered  as  testing 
whether  the  distributions  of  the  two  populations,  when  cumulated  up 
to  the  point  M,  contain  equal  areas.  If  the  two  populations  are  iden¬ 
tical  this  will  be  the  case,  so  the  small-sample  median  test  can  be 
used  to  test  the  hypothesis  of  identical  population  distributions. 

(See  "Null  Hypothesis"). 


If  N  is  an  odd  number,  the  pooled  sample  median  has  the  same 
value  as  one  of  the  obtained  scores.  Since  this  score  is  neither  above 
nor  below  M,  it  represents  a  third  "binomial"  outcome  and  violates  one 
of  the  assumptions  on  which  the  test  is  based.  (If  N  is  large  the  conse¬ 
quence  of  this  violation  will  be  slight.  )  If  N  is  even,  this  problem  does 
not  arise.  However,  in  this  case,  M  does  not  have  a  specific  value,  but 


rather  can  be  defined  only  as  lying  somewhere  between  the 


N 

— th  and  the 


N^lth 

2 


ranked  scores. 


Thus  the  null  hypothesis,  that  equal  proportions 


of  the  two  populations  lie  above  M,  becomes  equally  vague.  Summarizing, 
then,  the  median  test  is  an  approximate  test  for  identical  but  unknown 
quantiles.  As  sample  size  increases,  it  becomes  more  nearly  exact  and 
the  unknown  quantile  approaches  .  50  so  that  it  tends  to  become  a  test  for 
equal  population  medians. 


h.  Tables .  All  tables  for  Fisher  *s  Exact  Method  are  appro¬ 
priate.  (See  1.  Fisher's  Exact  Method,  g).  Tables  especially  designed 
for  median  test  have  been  published  by  Westenberg  (40,  41,  43). 
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i.  Sources,  Z6,  40,  41,  43, 


3,  The  Median  Test  for  Linear  Trend 


Cox  and  Stuart  (11)  have  pointed  out  that  if  Sample  I  is  taken 
to  be  the  first  half,  and  Sample  II  the  second  half,  of  a  series  of  ob¬ 
servations  taken  sequentially,  the  median  test  can  be  used  to  test  for 
linear  trend.  If  as  time  passes  the  population  distribution,  without 
changing  in  shape,  simply  **slides**  upward  or  slides  downward  uni- 
dir ectionally  on  the  *^x-axis",  then  the  proportion  of  values  above  M 
in  Population  I  will  not  be  the  same  as  the  corresponding  proportion 
in  Population  11,  (Here  Population  I  is  the  temporally  changing  pop¬ 
ulation  considered  as  existing  from  the  beginning  of  sampling  until 
half  of  the  observations  have  been  taken.  Population  II  being  similarly 
defined  for  the  remaining  interval.  )  And  this  statement  will  be 
equally  valid  whatever  quantile  M  represents  when  the  null  hypothesis 
is  true.  Therefore,  if  it  Cctn  be  legitimately  assumed  that  the  sam- 
pled  population  may  change  in  the  location  but  not  in  the  shape  of  its 
distribution,  the  test  will  be  sensitive  to  "slippage**  of  any  location 
parameter,  and  the  question  of  how  closely  M  represents  the  common 
population  median  will  not  be  a  problem. 

Generally,  however,  a  change  in  location  is  accompanied  by 
a  change  in  dispersion,  and  therefore  by  a  change  in  the  form  of  the 
population  distribution.  Therefore,  in  the  generality  of  cases  the 
additional  assumption  will  not  be  legitimate.  In  such  cases  if  the 
null  hypothesis  is  false,  the  true,  i.  e.  ,  **alternative**,  hypothesis  is 
that  M  is  a  different  quantile  in  Population  II  than  in  Population  I, 
i,  e,  ,  the  cumulative  distributions  of  Populations  I  and  II  have  different 
ordinates  at  the  abscissa  point  M.  If  the  additional  assumption  can¬ 
not  be  made,  then,  the  test,  in  effect,  tests  for  shift  in  an  unknown 
quantile  which  may  be  near  to  or  far  from  the  population  median. 

The  asmyptotic  relative  efficiency  of  the  median  test  for 
trend,  relative  to  ’*the  best  (parametric)  test  against  normal  re¬ 
gression,  based  on  the  sample  regression  coefficient,  b,  **  is  .78 
(11,  35),  This  is  the  same  as  the  A.  R,  E.  of  Cox  and  Stuart's 
S2  sign  test  for  trend. 
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4.  Westenberg^s  Test  for  Interquartile  Range 


Westenberg  (43)  has  proposed  a  modification  of  his  own 
median  test  in  which,  instead  of  dividing  each  sample  into  observa¬ 
tions  above  and  observations  below  the  median  of  the  pooled  sample, 
the  samples  are  divided  into  observations  within  and  observations  out¬ 
side  of  the  interquartile  range,  to  of  the  pooled  sample. 


Within 
Ql  -  Q3 

Outside 

Q1-Q3 

Sample  I 

a 

b 

m 

Sample  II 

c 

d 

n 

Total 

N/2 

N/2 

N 

Since  the  expected  proportion  of  observations  above  a  median  is  the 
same  as  the  expected  proportion  of  observations  within  an  interquartile 
range,  the  two  tests  have  identical  mathematical  bases.  The  perform¬ 
ance  of  the  interquartile  range  test  is  therefore  analogous  to  that  of  the 
median  test.  The  null  hypothesis  is  that  identical  proportions  of  Popu¬ 
lations  I  and  n  lie  within  the  interquartile  range  of  the  pooled  samples. 
The  test  therefore  does  nottest  whether  tie  two  populations  have  equal 
interquartile  rainges;  it  tests  whether  they  have  equal  areas  included 
between  the  values  Qj^  and  which  were  obtained  from  the  samples. 
(See  ^*Discussion"  of  the  median  test.  )  The  efficiency  of  the  test 
apparently  is  unknown.  Treatment  of  ties  is  analogous  to  that  of 
the  median  test;  all  ties  may  be  categorized  conservatively;  or  in 
each  sample,  half  of  the  observations  tied  with  either  or  be 
counted  as  **within**,  half  as  "outside**  and  any  odd  tied  observation 
discarded. 


5.  A  **Median"  Test  for  Correlation 

a.  Rationale.  Consider  a  sample  of  units  or  individuals 
upon  each  of  which  an  x  measurement  and  a  y  measurement  have 
been  made.  Let  its  scattergram  be  divided  into  four  quadrants  by 
a  horizontal  line  through  the  sample *s  y  median  and  a  vertical  line 
through  its  x  median.  Then  if  the  x  and  y  attributes  are  uncorrel¬ 
ated,  one  would  expect  each  of  the  four  quadrants  to  contain  about  the 


91 


same  number  of  units;  while,  if  a  correlation  exists,  a  preponderance 
of  units  should  be  located  in  one  of  the  two  pairs  of  diagonal  quadrants. 

If  the  X  and  y  attributes  are  uncorrelated,  dividing  the  ori¬ 
ginal  sample  into  two  equal  sized  samples  on  the  basis  of  some  char¬ 
acteristic  of  X  will  divide  the  y*s  into  two  ”y-samples”  which  differ 
on  the  basis  of  chance  alone.  They  are  therefore  two  samples  from 
the  same  population  of  y*s,  and  in  each  sample  the  proportion  of  y*s 
having  any  specified  y  characteristic  should  also  differ  on  the  basis 
of  chance  alone.  On  the  other  hand  if  x  and  y  are  correlated  in 
respect  to  the  criteria  used  to  subdivide  the  sample,  the  two  y- 
samples  will,  in  a  sense,  be  from  different  populations  which  contain 
different  proportions  of  y*s  with  the  relevant  ,  specified  character¬ 
istic. 


This  treatment  of  correlation  reduces  therefore  to  Fisher's 
Exact  Method  with  categories  as  shown  below: 


Sample  I:  y's  whose  paired  x  is  above  sample  x  median 
Sample  II:  y 's  whose  paired  x  is  below  sample  x  median 

Total 


A: 

B: 

Above 

Below 

Sample 

Sample 

y  medicin 

y  median 

Total 

a 

b 

m 

c 

d 

n 

r 

s 

N 

The  categorizations  and  designation  of  table  frequencies  can  be  simpli¬ 
fied  to  the  following,  "units'*  being  the  item  tabled,  a  unit's  x  measure 
being  referred  to  in  the  rows,  its  y  measure  in  the  columns. 


Above  Below 
y  median  y  median  Total 


Above 

X  median 

a 

N 

- a 

2 

N/2 

Below 

X  median 

1 

P 

a 

N/2 

Total 

N/2 

N/2 

N 
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The  point  probability  for  the  tabled  frequencies,  if  the  null  hypothesis 
of  no  correlation  is  true,  is  therefore 

[(N/2)i1^ 

NI  (ai)^  [i^  -a):]^  * 

b.  Null  Hypothesis.  In  the  parent  population,  those  units 
whose  X  value  exceeds  the  sample  x-median  have  the  same  propor¬ 
tion  of  y's  above  the  sample  y-median  as  have  those  units  whose  x 
value  is  less  than  the  sample  x-median.  A  sufficient  condition  for 
its  validity  is  that  the  x  eind  y  attributes  are  uncorrelated. 

c.  Assumptions.  Same  as  for  Westenberg's  median  test; 

see  2, 


d.  Treatment  of  Ties.  Tied  scores  are  a  problem  only  when 
tied  with  one  or  both  of  the  sample  medians.  For  a  conservative  test 
all  such  ties  may  be  categorized  in  the  meinner  least  conducive  to  re¬ 
jection  of  the  null  hypothesis.  Alternatively,  to  minimize  tie  error, 
half  of  the  scattergram  units  lying  on  the  line  separating  two  qua¬ 
drants  may  be  counted  as  belonging  to  each  quadrant.  If  there  are 

am  odd  number  of  such  units,  the  odd  unit  should  be  held  for  discard¬ 
ing.  Units  lying  on  the  intersection  of  the  two  median  lines  should 
be  discarded.  Before  discarding,  a  certain  number  of  "units"  may 
be  salvaged.  For  example,  if  one  unit  has  its  x  value  tied  with 
the  X  median  and  another  unit  has  its  y  value  tied  with  the  y  median, 
two  new  "units"  may  be  formed  from  the  old  ones,  one  of  which  has 
nontied  x  and  y  values,  the  other  having  both  x  and  y  values  tied 
with  their  medians.  Only  the  latter  new  unit  need  be  discarded, 
the  former  being  "returned"  to  the  sample.  The  value  N  should 
refer  to  the  number  of  units  remaining  in  the  sample  after  all  dis¬ 
carding  has  been  completed.  When  ties  are  treated  in  this  manner, 
marginal  frequencies  need  not  all  equal  N/2  so  the  formula 

— — L: -  from  Fisher's  Exact  Method  should  be  used  to  cal- 

Nl  al  bl  cl  dl 

culate  probabilities. 

e.  Efficiency.  Applied  to  populations  known  to  have  normally 
distributed  x*s  and  normally  distributed  y*s,  the  test  has  an  asymptotic 
local  efficiency  of  (2/7r)^  or  .41  relative  to  the  correlation  coefficient 
p  .  Under  the  same  circumstances  its  asymptotic  efficiency  relative 
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to  Kendall's  rank  order  correlation  coefficient,  t,  is  4/9.  (4) 


f.  Application,  Find  the  sample  x  and  y  medians,  construct 
the  fourfold  table  shown  in  **Rationale**  treating  ties  as  outlined  under 
(d),  and  apply  Fisher's  Exact  Method, 


g.  Discussion,  If  no  correlation  exists,  then,  on  the  average, 
half  of  the  sample  units  should  fall  in  the  **North-West*'  and  **South- 
East**  quadrants  and  half  in  the  opposite  diagonal  pair.  It  might  be 
supposed,  therefore,  that  the  number  of  units,  r,  in  one  of  the  pairs 
of  diagonal  quadrants  would  be  binomially  distributed  with  p  =  ,  50 

.  when  the  null  hypothesis  is  true,  so  that  (^)  (,  50)^  would  be  the 


point  probability  of  the  obtained  results.  Such  a  supposition  would  be 
in  error.  The  binomial  test  would  require  that  the  categorization  of 
each  unit  to  one  of  the  diagonal  pairs  of  quadrants  be  independent  of 
the  categorization  of  every  other  unit.  However,  it  is  in  the  nature 
of  the  construction  of  the  table  that  equal  numbers  of  units  must  'Tall*' 
in  diagonally  opposite  quadrants.  Thus,  for  each  unit  falling  in  a 
given  quadrant,  another  unit  must  fall  in  the  diagonally  opposite  quad¬ 
rant  and  therefore  must  receive  the  same  binomial  categorization 
given  the  first  unit.  For  example,  if  N  =  4,  there  are  three  possible 


tables t  ^ 

2  1 

1 

0 

2 

0  ’  l1 

IT  0 

2 

There  are  6  permutations  of  the 


N  units  which  will  give  the  first  table,  24  which  will  yield  the  second, 
and  6  which  result  in  the  third.  Thus  the  respective  probabilities  of  the 
three  tables  are  6/36,  24/36  and  6/36  or  1/6,  4/6  and  1/6,  These  are 


also  the  probabilities  obtained  by  using  the  formula 


[(N/2)I]^ 

n:  (a:)^[(^  -  a)i]^ 


If  the  binomial  test  is  applied  to  the  three  tables,  the  respective  probabil¬ 
ities  are  calculated  to  be  1/16,  6/l6,  and  1/16,  Not  only  are  these  "prob¬ 
abilities**  different,  but  their  sum  is  1/2  rather  than  1,  clearly  indicating 
that  the  test  is  fundamentally  in  error.  The  sum  of  the  ''probabilities ** 
is  1/2  rather  than  1  because  the  number  of  units  in  a  pair  of  diagonally 
opposite  quadrants  can  only  be  an  even  number,  while  a  truly  binomial 
variate  can  assume  any  integral  value  between  zero  and  N.  Nor  would 
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it  be  correct  to  confine  the  binomial  test  to,  say,  the  upper  two  quad¬ 
rants,  calculating  the  probability  that,  of  the  N/2  units  in  the  upper 
two  quadrants  a  of  them  would  fall  in  the  left  quadrant.  In  the 
upper  half  of  the  tables  just  discussed,  the  number  of  permutations 
of  the  N/2  imits  which  will  give  the  three  results  shown  are  1,  2, 
and  1.  The  probabilities  for  the  upper  halves  of  the  three  tables, 
considered  separately  and  as  if  independent  of  the  lower  halves, 
are  therefore  1/4,  2/4,  and  1/4,  which  are  also  those  obtained  by 
using  the  binomial  formula.  Thus  the  table,  taken  as  a  whole,  has 
a  different  probability  than  its  upper  half  alone.  Clearly,  then  the 
dependence  between  units  in  diagonally  opposite  quadrants  is 
a  partial  dependence  which  can  neither  be  ignored,  by  applying  a  bi¬ 
nomial  test  to  the  number  of  units  in  a  diagonal  pair  of  quadrants,  nor 
be  treated  as  a  complete  dependence  by  confining  the  binomial  test  to 
the  upper  half  of  the  table.  The  error  shown  to  exist  in  the  binomial 
approach  is  not  confined  to  very  small  sample  sizes.  For  example. 


the  table 


0 

00 

I 

8 

0 

has  the  ^probabilities 1/12,870, 


1  /65, 536,  and 


1  /256  respectively  when  tested  by  Fisher^s  Exact  Method,  by  the 
binomial  test  applied  to  the  entire  table,  and  by  the  binomial  test 
applied  only  to  the  upper  half  of  the  table. 


If  the  sample  is  divided  into  quadrants  by  its  x  and  y  means, 
rather  than  medians,  the  "binomial"  approach  is  still  unconscionable. 
If  median  and  mean  are  identical  all  of  the  objections  discussed  above 
apply.  If  they  differ,  the  premise  that  half  of  the  sample  units  would 
be  expected  to  lie  in  a  pair  of  diagonally  opposite  quadrants  is  false, 
and  the  binomial  parameter,  p,  does  not  have  the  value,  .50,  substi¬ 
tuted  in  the  formula  used  to  calculate  probabilities, 

h.  Tables,  Tables  for  Fisher^s  Exact  Method  are  appro¬ 
priate,  See  1. 


i.  Sources,  4,  8,  10, 


6,  Test  for  a  Difference  between  Correlated  Proportions 

a.  Rationale.  If  each  of  N  units  or  individuals  have  been  cate¬ 
gorized  as  belonging  to  one  or  the  other  of  two  mutually  exclusive  cate¬ 
gories  I  and  II,  and  the  same  N  \inits  have  been  categorized  according 
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to  another  mutually  exclusive  dichotomy  A  and  B,  the  experimenter 
may  wish  to  know  whether  or  not  in  the  parent  population  the  pro¬ 
portion  of  I*s  differs  from  the  Proportion  of  A*s.  Let  the  frequency 
data  be  represented  by  the  accompanying  table. 


A  B 


I 

a 

b 

m 

II 

c 

d 

n 

r 

s 

N 

Letting  primes  indicate  population  values  corresponding  to  sample 

a*  +  b* 

frequencies,  the  proportion  of  I’s  is  m^/N'  or 

a*  +  c’ 


N' 


and  the  pro¬ 


portion  of  A’s  is  r’/N*  or. 


only  if  b‘  =  c*.  Therefore,  ^the  hypothesis  of  equal  proportions  can  be 
tested  by  examining  the  probability  of  obtaining  the  sample  b  and  c  by 
random  sampling  of  b+c  units  from  an  infinite  population  consisting  of 
equal  numbers  of  b’’  s  and  c’’  s.  Thus  the  point  probability  for  the  ob- 

b"l“C  b"l“C 

tained  b  and  c  is  given  by  the  binomial  (  ^  )  (•  5) 


These  two  proportions  are  equal 


b.  Null  Hypothesis,  In  an  infinite  population  of  units  each  of 
which  is  classed  as  either  I  or  II  and  as  either  A  or  B,  the  proportion 
of  units  categorized  as  I*s  has  the  same  value  as  the  proportion  of  units 
categorized  as  A*s.  If  this  hypothesis  is  true,  it  follows  inevitably 
that  there  are  exactly  as  many  II  A  units  as  I  B  units  in  the  population 
of  I  A*s,  II  A*s,  I  B*s  and  II  B*s,  and  this  is  the  hypothesis  actually 

te  sted, 

c.  A s sumptions ,  Since  the  test  is  a  binomial  one,  it  depends 
upon  the  usual  binomial  assumptions:  (1)  sampling  is  random,  (2)  cate¬ 
gorization  of  one  unit  does  not  influence  the  categorization  of  any  other 
unit,  i,  e,  ,  units  are  independent  and  are  drawn  from  an  infinite  popu¬ 
lation  of  potential  units,  (3)  the  population  selected  for  test,  i.  e.  ,  the 
units  categorized  II  A  or  IB,  constitutes  a  dichotomy,  (4)  the  dicho¬ 
tomized  categories  II  A  and  I  B  are  mutually  exclusive.  In  addition. 


96 


the  iinique  construction  of  the  test  necessitates  the  following  assump¬ 
tions;  (5)  the  I  and  II  categories  are  mutually  exclusive  as  are  the 
A  and  B  categories,  thereby  making  the  four  categories  I  A,  II  A, 

I  B,  and  II  B  mutually  exclusive  (the  latter  is  required  in  order  that 
the  "no  trial"  categories,  I  A  and  II  B,  will  contain  none  of  the  II  A 
or  I  B  attributes,  those  actually  tested,  thus  making  the  exclusion 
of  I  A  and  II  B  data  legitimate.)  (6)  every  unit  categorized  either 
I  or  II  is  also  categorized  either  A  or  B  and  vice  versa,  i,  e,  , 
the  "I  II"  and  "A  B"  categorizations  are  applied  to  the  same  data; 
unless  this  is  the  case,  the  data  cannot  legitimately  be  cast  into 
a  fourfold  table,  but  specifically  the  proportions  of  I's  and  A's  can¬ 


not  be  represented  as 


a'  +  b' 
N' 


and 


a'  +  c' 
N' 


respectively,  and  a  dif¬ 


ference  between  b'  and  c'  is  not  sufficient  to  demonstrate  a  difference 
between  the  two  proportions. 


d.  Efficiency.  No  information  seems  to  be  available;  however, 
it  would  appear  logical  that  the  test  efficiency  would  be  high  since  the 
test  appears  to  make  efficient  use  of  all  the  "information"  available, 

e.  Application.  Draw  a  sample  of  N  units  from  the  population 
in  question,  and  let  the  table  shown  in  "Rationale"  represent  the  fre¬ 
quency  data  categorized  according  to  each  of  the  dichotomies  I  or  II 
and  A  or  B.  Let  a  represent  the  level  of  significance  chosen,  and 

let  r  represent  the  smaller  of  the  two  frequencies,  b  and  c. 

For  a  two-tailed  test  of  the  null  hypothesis  that  in  the  parent 
population  the  unknown  proportion  of  I's  is  the  same  as  the  correlated, 

r  b“t*c  b“t*c 

unknown  proportion  of  A*s,  reject  if  2  ^  )  (•  5)  <  a  .  For  a 

one-tailed  test,  reject  the  hypothesis  that  the  proportion  of  I's  is  either 
the  same  or  smaller  than  the  proportion  of  A's  if 

c  b‘4*c  b“t*c 

Sq  (  ^  )(.5)  <  a.  Or,  for  the  opposite  one-tailed  test,  reject  the 

hypothesis  that  the  proportion  of  I's  is  either  the  same  or  greater  than 
the  proportion  of  A's  if  <a. 


f.  Discussion.  McNemar  (25),  who  originated  the  test,  used 
the  chi  square  approximation,  rather  than  the  binomial,  with 


2 

X  = 


(b- 


b+c  .2 

~ir> 


+ 


(C  - 


b+c  j2 


b+c 

~T- 


which  reduces  to 


2 

X 


(b-c)^ 

b+c 
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with  one  decree  of  freedom.  The  binomial,  however,  is  the  exact  test 
and  should  be  used  unless  b+c  is  very  large,  in  which  case  either  test 
may  be  used. 

Although  this  test  bears  a  superficial  similarity  to  the  bino¬ 
mial  test  for  correlation  criticised  in  the  **Discussion**  section  of 
(5.  A  **Median**  Test  for  Correlation),  the  objections  voiced  there 
do  not  apply  here.  In  the  present  test,  categories  are  completely 
specified  in  advance  of  sampling,  the  categorization  of  one  unit  does 
not  influence  the  categorization  of  any  other  unit,  and  the  **popula- 
tion"  from  which  the  sample  is  considered  to  have  been  obtained  is 
the  parent  population  from  which  the  b+c  units  were  drawn.  In  the 
binomial  test  for  correlation,  on  the  other  hand,  categories  were 
established  after  sampling  and  were  a  function  of  the  sample  results, 
and  the  categorizations  of  units  were  not  independent.  The  proper 
analysis  of  such  data  requires  that  the  test  be  a  ^^conditional”  test  in 
which  the  obtained  table  is  regarded  as  a  sample  from  a  population  of 
tables.  Each  table  in  a  population  of  tables  with  fixed  marginal  fre¬ 
quencies  is  a  different  permutation  of  the  units  constituting  the  cell 
frequencies.  Therefore,  in  calculating  probabilities  for  such  condi¬ 
tional  tests  all  permutations  and  therefore  all  cells  must  be  consid¬ 
ered,  Since  McNemar*s  test  is  not  a  conditional  test,  no  restrictions 
having  been  placed  on  marginal  frequencies,  units  may  distribute 
themselves  in  the  ”b”  and  ”c”  cells  strictly  according  to  the  binomial 
law. 


g.  Tables.  Use  tables  of  the  cumulative  binomial  probability 
with  p  *  ,  5,  or  tables  for  the  Sign  Test,  (See  Chapter  II).  Tables 
especially  designed  for  the  application  of  this  test  employing  the  chi 
square  approximation  to  the  binomial  have  been  published  by  Swine - 
ford  (37), 

h.  Sources,  25,  37, 
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CHAPTER  V 


TESTS  BASED  ON  FISHER^S  METHOD  OF  RANDOMIZATION  I 


The  logical  basis  for  most  distribution-free  tests  is  rooted  in  a 
method  originated  by  R,  A.  Fisher  and  known  as  the  Method  of  Rando¬ 
mization.  The  basis  of  statistical  inference  is  simply  this.  If  sev¬ 
eral  samples  have  been  drawn  from  a  common  population,  they  may 
be  regarded  as  one  large  sample  whose  observations  have  been  ran¬ 
domly  assigned  to  subsamples  or  component  samples  of  the  sizes 
actually  drawn.  Each  of  the  different  possible  random  assignments 
was,  prior  to  sampling,  equally  likely  to  be  the  actually  obtained 
sample,  if  the  null  hypothesis  of  identical  populations  is  true,  but 
unequally  likely  to  be  if  the  null  hypothesis  is  false.  By  choosing  a 
test  statistic  which  is  sensitive  to  the  alternative  hypothesis  and  cal¬ 
culating  its  value  for  each  of  the  n  different  possible  random  assign¬ 
ments,  one  obtains  a  set  of  n  equally  weighted  values  of  the  test 
statistic  (some  of  which  are  the  same)  which  form  the  distribution 
of  the  test  statistic  under  the  null  hypothesis.  Its  rejection  region 
is  simply  the  N  most  extreme  of  these  values  each  of  which  is  exactly 
as  likely  as  any  other  value  when  the  null  hypothesis  is  true,  but  which 
become  especially  probable  when  the  alternative  hypothesis  is  true. 

If  the  test  statistic  for  the  actually  obtained  sample  falls  within  the 
rejection  region,  the  null  hypothesis  can  be  rejected  at  the  N/n  level 
of  significance. 

The  method,  as  developed  by  Fisher,  has  been  improved  by  Wil- 
coxon  who,  by  replacing  original  observation  magnitudes  by  their  ranks, 
’*standardized**  the  rejection  region  and  permitted  tabling  of  probabili¬ 
ties.  Wilcoxon*s  tests  are  among  the  most  efficient  and  most  impor¬ 
tant  distribution-free  tests.  The  sample  information  used  by  Fisher 
was  the  sample  mean  or  mean  difference;  Wilcoxon  used  rank  sums 
or  sums  of  algebraically  signed  ranks.  Both  constructed  tests  sen¬ 
sitive  to  location. 
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1,  Fisher Method  of  Randomization:  Matched  Pairs 


a.  Rationale,  Let  n  matched  pairs  of  observations  be  taken, 
one  member  of  each  pair  having  been  taken  under  treatment  A,  the 
other  under  treatment  B.  If  each  B  observation  is  subtracted  from 
its  paired  A  observation,  there  will  be  n  difference  scores  henceforth 
referred  to  as  the  obtained  sample.  If  the  A  and  B  treatments  have 
equal  effects,  in  all  respects  to  which  the  measurements  are  sensitive, 
then  the  members  of  any  given  matched  pair  of  observations  may  be 
regarded  as  having  been  drawn  from  the  same  population.  In  this 
case  ‘‘treatment  A“  and  “treatment  B“  are  merely  arbitrary  labels 
which  are  applied  to  two  random  observations  from  the  same  popu¬ 
lation,  and  a  specified  one  of  the  two  observations  is  as  likely  to 
acquire  the  label  “A“  as  to  be  labeled  “B“.  The  difference  score  for 
any  given  pair  of  observations  is  therefore  as  likely  to  be  plus  as  to 
be  minus.  If  the  A  and  B  treatments  produce  effects  whose  distri¬ 
butions  are  not  identical  but  which  are  symmetrical  about  the  same 
point,  a  given  difference  score  is  also  as  likely  to  be  plus  as  to  be 
minus  because  for  each  A^  -  B.  difference  score  in  the  population 
there  is  an  equally  likely  “mirror-image“  difference  score  of  equal 
magnitude  but  opposite  sign.  Therefore  if  either  (a)  the  A  population 
and  the  B  population  are  identical,  or  (b)  if  the  A  and  B  populations 
are  symmetrical  about  the  same  point,  each  difference-score,  what¬ 
ever  its  magnitude,  will  be  as  likely  to  be  plus  as  to  be  minus.  Since 
plus  and  minus  are  equally  likely  algebraic  signs  for  each  of  the  n 
difference  magnitudes,  each  of  the  2^  different  possible  arbitrary 
assignments  of  algebraic  signs  to  the  obtained  difference  magnitudes 
is  equally  likely  for  a  sample  containing  these  difference  magnitudes 
(provided  no  difference  magnitudes  are  zero  for  which  an  algebraic 
sign  is  meaningless).  That  is  to  say,  there  are  two  ways  of  assigning 
algebraic  sign  to  the  first  difference  magnitude;  for  each  of  these  ways 
there  are  two  vays  of  assigning  sign  to  the  second  magnitude,  making 
four  distinguishable  combinations;  for  each  of  these  four  combinations 
the  third  magnitude  can  be  treated  in  two  ways,  making  eight  combina¬ 
tions,  etc,,  so  that  for  n  difference  scores  there  are  2^  distinguish¬ 
able  patterns  of  algebraic  sign  which  can  be  “superimposed“  upon  the 
obtained  set  of  difference  magnitudes;  and  if  the  sampled  populations 
are  either  identical  or  symmetrical  about  the  same  point  each  of  these 
2  sets  of  difference  scores  were  exactly  as  likely  to  have  been  drawn 
as  a  sample  as  was  the  set  constituting  the  obtained  sample. 
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Imagine  now  that  for  each  of  the  sets  of  difference  scores  a 
mean  difference  has  been  calculated  by  summing  the  n  difference  scores 
and  dividing  by  n.  If  the  A  and  B  populations  are  identical  or  are  sym¬ 
metrical  about  the  same  point,  each  of  these  2^  mean  differences  will  be 
equally  probable#  The  N  largest  of  these  2^  mean  differences  should 
therefore  contain  the  mean  difference  for  the  obtained  sample  in  exactly 

On  the  other  hand,  if  the  A  and 


N 


a  proportion  ±1  of  such  experiments, 

B  populations ^are  identical  in  form  but  differ  in  location,  or  if  they  are 

both  symmetrical  but  not  symmetrical  about  the  same  point,  the  mean 

difference  for  the  obtained  sample  is  more  likely  to  lie  among  the  ex- 
n  . 

treme  N  of  the  2  mean  differences  than  the  proportion  —  would  imply. 

2^ 


And  even  if  the  two  populations  have  nonidentical,  asymmetrical  forms, 
one  would  generally  expect  large  mean  differences  to  be  more  likely 
than  small  ones  if  the  populations  have  different  means. 


b.  Null  Hypothesis,  Each  of  the  2^  unique  sets  of  difference 
scores  obtainable  by  arbitrarily  assigning  algebraic  signs  to  the  ob¬ 
tained  difference-score  magnitudes  is  equally  likely  to  have  been 
drawn  as  a  sample#  Either  of  two  conditions  is  sufficient  to  insure 
the  validity  of  the  null  hypothesis:  (a)  the  sampled  populations  are  iden¬ 
tical,  (b)  the  sampled  populations  are  both  symmetrical  and  are  sym¬ 
metrical  about  a  common  point.  By  taking  as  the  rejection  region  the 
N  sets  with  the  N  greatest  mean  differences,  the  method  of  randomiza¬ 
tion  tests  the  null  hypothesis  that  populations  are  identical  or  symmet¬ 
rical  about  a  common  point  against  the  alternative  that  the  populations 
have  different  means.  It  is  merely  **most  sensitive”  against  this  alter¬ 
native,  however,  since  nonidentity  of  populations  with  equal  means  can 
also  cause  rejection.  Certain  assuTiptions,  therefore,  are  necessary 
to  eliminate  such  alternatives  when  chey  are  not  desired. 


c.  Assumptions,  By  taking  as  the  probability  fraction  the 
ratio  of  the  number  of  ways  certain  events  can  occur,  it  is  implied 
that  each  way  is  equally  probable  when  the  null  hypothesis  is  true  and 
unequally  probable  when  it  is  false#  However,  they  can  be  unequally 
probable,  not  because  populations  violate  the  null  hypothesis,  but  rather 
because  of  bias  in  the  selection  of  samples  or  because  of  the  influence 
of  one  sample  unit  upon  another.  Therefore,  in  order  to  eliminate 
such  contingencies,  it  is  assumed  that  sampling  is  random  and  obser¬ 
vations  are  independent. 

By  using  2^  as  the  denominator  of  the  probability  fraction,  it 
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is  implied  that  each  difference-score  has  two  possible  values,  one  plus 
and  the  other  minus.  This  means  that  there  must  be  no  zero  differences, 
or  the  equivalent,  but  more  general,  assumption  of  continuously  distri¬ 
buted  populations  may  be  made. 

If  the  populations  do  not  have  the  same  form  or  if  they  are  not 
symmetrical,  then  the  obtained  difference  scores  are  not  necessarily 
as  likely,  a  priori,  to  be  minus  as  to  be  plus  even  though  the  sampled 
populations  have  equal  means.  In  order  therefore  to  "eliminate”  such 
causes  of  unequally  likely  signs  and  confine  the  cause  to  unequal  popu¬ 
lation  means,  it  is  necessary  to  introduce  the  assumption  that  either 
(a)  the  two  sampled  populations  have  identical  forms,  differing,  if  at 
all,  only  in  location,  or  (b)  each  sampled  population  has  a  symmetrical 
distribution,  the  two  distribution  forms  not  necessarily  being  the  same, 

d.  Treatment  of  Ties,  If  the  number  of  zero  differences,  t, 

is  small  relative  to  the  total  number  of  difference  scores,  discard  them 
and  reduce  n  by  t  in  all  subsec^ent  calculations,  so  that  the  denominator 
of  the  probability  fraction  is  2  ,  It  sho\iId  be  borne  in  mind  that  dis¬ 

carding  the  zero  differences  artificially  increases  the  power  of  the  test. 

e.  Efficiency.  No  figures  appear  to  be  available;  however, 
there  is  reason  to  believe  efficiency  should  be  high.  See  Wilcoxon  test, 

f.  Application.  As  an  example,  suppose  that  each  of  seven 
individuals  have  been  subjected  to  each  of  two  treatments,  A  and  B,  and 
that  there  are  no  Sequential  or  interaction  effects  between  treatments. 

The  data  are  presented  in  the  following  table. 


SCORES  S  mean 


Treatment  A 

23 

16 

11 

12 

9 

5 

1 

77 

11 

Treatment  B 

8 

5 

2 

7 

6 

4 

3 

35 

5 

Difference:  A-B 

15 

11 

9 

5 

3 

1 

-2 

42 

6 

There  are  2  or  128  different  ways  of  distributing  plus  and  minus 
signs  among  the  seven  difference  scores.  Three  of  these  ways  result 
in  a  positive  mean  difference,  and  six  result  in  an  absolute  mean  dif¬ 
ference,  as  great  or  greater  than  that  actually  obtained.  They  are  as 
follows : 
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Difference  Scores 


S  Mean 


15 

11 

9 

5 

3 

1 

2 

46 

6.  57 

15 

11 

9 

5 

3 

-1 

2 

44 

6.29 

15 

11 

9 

5 

3 

1 

-2 

42 

6.  00 

-15 

-11 

-9 

-5 

-3 

-1 

-2 

-46 

-6.  57 

-15 

-11 

-9 

-5 

-3 

+  1 

-2 

-44 

-6.29 

-15 

-11 

-9 

-5 

-3 

-1 

+  2 

-42 

-6.  00 

As  indicated,  only  a  small  number  of  the  2^  mean  differences 
need  actually  be  calculated,  specifically  those  equal  to  or  more  ex¬ 
treme  than  that  actually  obtained  or  those  constituting  the  rejection 
region,  whichever  is  less.  Therefore,  assuming  populations  iden¬ 
tical  in  form,  the  hypothesis  that  treatments  have  equal  effects  can 
be  rejected  at  the  6/128  or  .047  level  of  significance  in  favor  of  the 
alternative  hypothesis  that  the  mean  effects  of  the  two  treatments 
differ.  Or,  under  a  one-tailed  test  the  hypothesis  that  treatment  A 
has  the  same  effect  or  less  mean  effect  than  treatment  B  can  be  re¬ 
jected  at  the  3/128  or  .  023  level  of  significance  in  favor  of  the  hypo¬ 
thesis  that  treatment  A  has  more  mean  effect  than  treatment  B.  If 
it  can  be  assumed  that  populations  are  either  identical  in  form  or 
symmetrical  ,  the  term  **effect"  must  be  replaced  by  "mean  effect" 
in  the  expression  of  the  null  hypothesis, 

g.  Discussion,  The  magnitudes  of  the  n  difference  scores 
are,  with  rare  exceptions,  unequally  likely.  However,  if  the  sampled 
populations  are  identical  or  symmetrical  about  a  common  point,  each 
of  the  2^  differently  "signed"  sets  of  difference  scores  is  equally  likely 
because  each  set  contains  the  same  magnitudes  and  each  magnitude  is 
as  likely  to  be  positive  as  to  be  negative.  If  the  null  hypothesis  is 
false,  one  of  the  two  algebraic  signs  will  be  more  probable  than  the 
other.  The  more  probable  sign  would  be  expected  either  to  occur 
more  frequently  than  its  opposite,  or  to  be  associated  more  frequently 
with  the  larger  than  with  the  smaller  magnitudes,  or  both.  The  like¬ 
lihood  that  a  difference  score  had  the  more  probable  algebraic  sign 
would  be  expected  to  increase  with  the  absolute  magnitude  of  the  differ¬ 
ence  score.  By  taking  as  the  rejection  region  those  sets  of  difference- 
scores  (equally  probable  when  the  null  hypothesis  is  true)  which  yield 
the  most  extreme  mean  differences,  one  is  quite  properly  permitting 
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the  larger  magnitudes  to  influence  rejection  more  than  the  smaller  ones. 
Thus  each  algebraic  sign  may  be  considered  to  be  ‘^weighted**  by  the  dif¬ 
ference-score  magnitude  to  which  it  is  attached.  This  weighting  is 
arbitrary,  i.  e.  ,  randomly  determined,  when  the  null  hypothesis  is 
true  and  the  distribution  of  the  test  statistic  is  such  that  each  weight 
is  applied  as  frequently  to  positive  as, to  negative  signs.  It  is  only 
when  the  null  hypothesis  is  false  that  the  weighting  takes  on  a  dis¬ 
criminating  function,  making  the  test  especially  sensitive  to  differ¬ 
ences  in  location. 


The  sample  space  for  the  test  statistic  consists  of  the  2^ 
sets  of  difference  scores  obtainable  by  varying  the  signs  attached 
to  the  same  set  of  n  difference  -  score  magnitudes.  The  test  is  there¬ 


fore  a  conditional  test  in  the  sense  that  the  probability  fraction 


N 


gives  the  chance  probability  of  drawing  the  obtained  sample,  or  a 
more  extreme  one,  from  that  artificially  limited  sample  space  rather 
than  from  the  larger  parent  population  of  difference  scores  from  which 
it  was  actually  drawn.  The  importance  of  this  fact  has  been  frequent¬ 
ly  overemphasized.  When  the  null  hypothesis  is  true  every  difference 
score  in  the  sampled  population  is  as  likely  to  be  plus  as  to  be  minus, 
not  just  those  in  the  restricted  sample  space.  Therefore  the  probability 
of  commiting  a  Type  I  error  is  unaffected  by  restricting  the  sample 


space,  being  exactly  — whatever  the  particular  set  of  difference 

2 

scores  sampled.  When  the  null  hypothesis  is  false  the  relative  prob¬ 
ability  of  possession  of  the  two  algebraic  signs  may  differ  greatly 
from  one  population  difference -score  magnitude  to  another  and  not 
necessarily  in  any  direct  relationship  to  the  absolute  size  of  the  mag¬ 
nitude.  Since  chance  determines  which  of  these  population  difference- 
scores  will  be  drawn  for  the  sample,  chance  plays  a  large  role  in 
determining  whether  or  not  a  false  hypothesis  will  be  rejected.  How¬ 
ever,  this  is  equally  true  of  nonconditional  tests.  It  is  more  or  less 
assumed,  for  both  conditional  and  nonconditional  tests,  that  the  sample 
is  fairly  representative  of  the  population.  To  the  extent  that  this  is 
untrue  both  types  of  test  are  likely  to  err;  to  the  extent  that  it  is  true 
the  restriction  of  the  sample  space  of  Fisher* s  conditional  test  statistic 
is  not  a  serious  shortcoming  of  the  test. 


In  connection  with  criticisms  of  the  conditional  nature  of 
Fisher*s  test  it  has  sometimes  been  fallaciously  implied  that  the  test 
statistic  has  the  same  distribution  under  an  alternative  hypothesis  as  it 
has  under  the  null  hypothesis.  When  the  null  hypothesis  is  false,  just  as 
many  of  the  2^  pos  sible  values  of  the  test  statistic  lie  in  the  rejection  region 
as  when  the  null  hypothesis  is  true.  However,  when  the  null  hypothesis 
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is  true  each  of  these  2^  values  is  equally  probable,  whereas  when  it 
is  false,  those  values  occupying  the  rejection  region  are  more  probable 
than  the  ones  occupying  the  acceptance  region,  thus  bias  sing  the  test 
(properly  so)  in  favor  of  rejection,  Student^s  t  test  operates  in  much 
the  same  way.  The  set  of  possible  values  of  t  is  the  same  whether 
the  null  hypothesis  is  true  or  false;  it  is  only  their  probabilities  which 
differ.  When  the  null  hypothesis  is  true  the  possible  values  of  t  con¬ 
stituting  the  rejection  region  have  a  cumulative  probability  of  a,  whereas 
when  it  is  false  they  have  a  cumulative  probability  greater  than  a.  It 
is  incorrect,  therefore,  to  imply,  as  has  been  done,  that  under  the  method 
of  randomization  the  test  statistic  has  the  same  distribution  under  alter¬ 
native  hypotheses  as  under  the  null  hypothesis.  This  is  no  more  true 
of  the  method  of  randomization  than  of  Student*s  t.  Although  Fisher ^s 
and  Student’s  tests  operate  in  somewhat  similar  ways,  however, 

Fisher’s  test  cannot  be  regarded  as  giving  the  *’true*’  probability 
which  Student’s  test  ^’approximates*’.  This  has  sometimes  been  implied, 
the  difference  in  the  two  probabilities  being  attributed  to  violations  of 
the  assumptions  of  Student *s  test  or  to  other  artifacts.  The  argument, 
however,  is  fallacious.  The  two  tests  cannot  be  expected  to  yield 
equal  probabilities  when  applied  to  the  same  sample  because  (a)  the 
test  statistics  do  not  have  the  same  distribution,  (b)  the  tests  do  not 
use  the  same  rejection  region. 

Although  many  of  the  criticisms  of  the  method  of  randomization 
have  been  overstated,  it  does  have  a  number  of  shortcomings  which 
will  be  outlined  in  the  following  paragraphs. 

Two  types  of  information  are  used  in  the  test:  algebraic  sign 
and  magnitude.  When  the  null  hypothesis  is  true  magnitudes  are  ran¬ 
domly  associated  with  equally  likely  algebraic  signs.  When  it  is 
false  magnitudes  become  nonrandomly  associated  with  unequally  prob¬ 
able  algebraic  signs  in  a  complex  way!  for  some  magnitudes  one  al¬ 
gebraic  sign  becomes  more  probable  than  the  other,  and  for  other  mag¬ 
nitudes  the  reverse  is  probably  the  case.  Presumably  the  larger  the 
magnitude  the  more  likely  it  usually  is  to  have  the  algebraic  sign  indi¬ 
cating  the  true  direction  of  difference;  however,  there  is  no  justification 
for  assuming  that  this  relationship  is  linear  or  even  monotonic.  Since 
each  sample  consists  of  a  different  set  of  magnitudes  and  since  the  mag¬ 
nitudes  are,  in  effect,  weights,  each  sample  from  the  same  population 
is  subjected  to  a  different  weight  function.  Since  the  weight  function 
varies  from  sample  to  sample  and  since  the  relationship  of  weight  to 
the  probability  of  a  given  algebraic  sign  is  unknown,  probability  levels 
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for  samples  from  the  same  population  are  not  strictly  comparable* 
Another  way  of  stating  this  is  that  probability  levels  are  not  strictly 
comparable  because  no  two  samples  use  the  same  rejection  region. 

Another,  related,  disadvantage  of  Fisher ^s  method  is  that 
the  test  is  quite  sensitive  to  isolated  extreme  difference  scores.  Sup¬ 
pose,  for  example,  that  the  obtained  set  of  difference-scores  were 
+  1,  +2,  +3,  +4,  +5,  +6,  +7,  +8,  +9,  +50.  There  are  2^^=  1024 
possible  ways  of  assigning  signs  to  these  magnitudes  and  the  mean  dif¬ 
ference  for  the  obtained  sample  can  be  equaled  or  exceeded  in  only 
one  of  them,  so  the  obtained  sample  has  a  one-tailed  probability  of 
1/1024  or  less  than  ,001.  However  if  the  algebraic  sign  of  the  50 
is  changed  to  minus,  the  obtained  mean  difference  becomes  -.5  which 
can  be  exceeded  by  any  of  the  512  assignments  in  which  the  50  is  plus. 
The  one-tailed  probability  therefore  drops  from  less  than  .001  to 
slightly  more  than  ,50  simply  by  changing  the  sign  of  one  of  ten  dif¬ 
ference  scores.  This  is  in  no  way  improper  since,  if  the  null  hypo¬ 
thesis  is  false,  one  would  expect  the  difference  in  probability  between 
a  +50  and  a  -50  to  be  much  greater  than  the  difference  in  probability 
between  a+1  and  a-l.  However  it  shows  that  the  test  gives  great 
weight  to  isolated  extreme  differences  which  frequently  one  wishes  to 
deemphasize  because  of  the  likelihood  that  they  are  spurious  or  repre¬ 
sent  atypical  performance  (or  response). 

A  final  disadvantage  is  that  Fisher ^s  method  of  randomization 
requires  that  of  the  2^  possible  ’*ways**  of  calculating  a  mean  difference 
(using  the  same  set  of  n  difference  magnitudes  but  varying  their  alge¬ 
braic  signs)  the  experimenter  must  actually  enumerate  either  the  num¬ 
ber  of  ways  constituting  the  rejection  region  or  the  number  of  ways 
which  result  in  a  mean  difference  equaling  or  exceeding  the  one  obtained, 
whichever  is  less.  If  n  is  large,  or  if  n  is  of  moderate  size  and  a  is 
large,  the  computations  are  likely  to  be  so  lengthy  as  to  make  the  test 
impractical.  Since  the  exact  forms  of  the  sampled  populations  are 
unknown  the  sample  difference  scores  are  of  unpredictable  magnitude 
and  it  is  impossible  to  construct  probability  tables  in  advance  of 
sampling, 

h.  Tables.  None.  Probabilities  must  be  calculated  for  each 
specific  case, 

i.  Sources.  4,  7 ,  17,  26,  27 ,  34,  38,  39,  40,  48,  7  5, 

See  also  16,  28,  41,  42,  43,  67,  68  under  4,  Fisher^s  Method  of  Ran¬ 
domization:  Unmatched  Data. 
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2.  The  Wilcoxon  Test:  Matched  Pairs 


a.  Rationale.  Wilcoxon  has  modified  Fisher's  method  by 
replacing  the  obtained  difference-scores  with  the  ranks  of  their  abso¬ 
lute  magnitudes,  each  rank  being  given  the  algebraic  sign  of  the  dif¬ 
ference-score  which  it  replaces.  The  test  statistic  is  the  algebraic 
sum  of  the  signed  ranks  rather  than  the  average  signed  rank;  since 
the  former  is  always  n  times  the  latter,  the  two  have  equivalent  dis- 
bributions.  Wilcoxon' s  modification  has  several  advantages  over  the 
original  test.  First,  the  test  is  not  a  conditional  one  since  the  sample 
space  for  the  test  statistic  is  the  same  for  every  sample.  Thus  every 
sample  is  made  comparable  with  every  other  sample  of  the  same  size 

in  the  sense  that  the  set  of  numbers  by  which  the  signs  of  the  differences 
are  weighted  is  always  the  same:  the  sign  of  the  largest  difference 
magnitude  always  being  given  a  weight  of  n,  the  next  largest,  n-1, 
etc.  Second,  the  test  is  less  sensitive  to  extreme  difference-score 
magnitudes  since  the  most  extreme  magnitude  will  receive  a  rank  only 
one  greater  than  the  next-to-extreme  magnitude,  etc.  Finally,  by 
using  ranks,  the  probabilities  can  be  tabled,  since  for  any  given  n, 
instead  of  n  random  and  unpredictable  magnitudes,  the  magnitudes 
consist  always  of  the  integers  1  to  n. 

If  each  obtained  differ ence- score  magnitude  is  as  likely  to 
be  plus  as  to  be  minus,  then  so  is  its  rank.  The  rationale  for  the 
Wilcoxon  test  therefore  parallels  that  for  Fisher's  method  of  random¬ 
ization.  See  1,  Fishery's  Method  of  Randomization:  Matched  Pairs. 

b.  Null  Hypothesis.  Each  of  the  2^  unique  sets  of  signed 
ranks,  obtainable  by  arbitrarily  assigning  algebraic  signs  to  the  ranks 
of  the  difference-score  magnitudes  from  the  obtained  sample,  is 
equally  likely  to  have  resulted  from  the  random  sampling  process. 
Either  of  two  conditions  are  sufficient  to  insure  the  validity  of  the 
null  hypothesis:  (a)  the  sampled  populations  are  identical,  (b)  the 
sampled  populations  are  both  symmetrical  and  are  symmetrical  about 
a  common  point.  For  populations  which  are  identical  or  symmetrical 
about  a  common  point,  medians,  as  well  as  means,  are  equal.  And 
if  the  two  populations  are  symmetrical,  but  symmetrical  about  differ¬ 
ent  points,  or  if  they  have  identical  forms,  but  different  locations, 
then  medians,  as  well  as  means,  differ.  Thus,  if  all  assumptions  are 
met,  the  Wilcoxon  test  is  both  a  test  for  equality  of  medians  and  a  test 
for  equality  of  means. 
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c.  Assumptions.  See  1,  Fisher*s  Method  of  Randomization: 
Matched  Fhirs,  substituting  **mean  and  median**  for  **mean**. 


d.  Treatment  of  Ties.  If  there  are  an  even  number,  x,  of 


**occupy**  the  x  lowest  ranks, 


zero  difference  scores,  consider  them  to 
give  each  of  them  the  midrank,  and  assign  half  of  them  a  plus  sign,  half 
a  minus  sign  in  the  obtained  sample.  Thus,  if  there  are  x  zero  differ¬ 


ences,  each  receives  the  rank. 


and  half  of  these  identical  ranks 


are  given  a  plus,  half  a  minus.  If  there  are  an  odd  number  of  zero 
differences,  the  odd  one  maybe  discarded  and  n  reduced  by  one.  Or, 

1  i 

1  X 

all  x+1  zero  differences  may  be  given  the  midrank  ■  ^ 


of  these  may  be  given  the  algebraic  sign  least  conducive  to  rejection 
of  the  null  hypothesis,  the  remainder  receiving  the  opposite  sign. 


If  nonzero  differences  are  tied  in  absolute  magnitude,  the 
members  of  each  tied  group  should  be  given  the  midrank  of  the  group, 
i.  e*  ,  the  average  rank  the  members  of  the  group  would  have  if  not 
tied  but  differing  infinitesimally  in  magnitude.  The  midrank  of  each 
tied  member  is  then  given  the  algebraic  sign  of  that  member.  An 
error,  which  is  usually  small,  is  introduced  by  the  occurrence  of 
ties  and  their  treatment  in  this  manner.  For  example,  consider  the 
following  set  of  signed  ranks:  1,  2,  3,  -4,  5,  6,  7,  An  equal  or 
smaller  negative  rank  sum  can  be  obtained  in  six  ways  and  the  signi¬ 
ficance  level  for  the  corresponding  one-tailed  test  is  6/2^  or  ,  047, 
However,  if  the  first  two  ranks  are  tied,  the  set  becomes  1  1/2, 

1  1/2,  3,  -4,  5,  6,  7  and  there  are  only  five  ways  of  obtaining  an 
equal  or  smaller  rank  sum  (because  3  and  11/2  sum  to  4  1/2  while 
3  and  1  sum  to  4,  the  value  not  to  be  exceeded).  The  significance 
level  is  therefore  5/2*  or  ,  039, 


The  above  treatment  minimizes  error  in  the  long  run.  To 
insure  that  zero  or  tied  differences  do  not  spuriously  cause  rejection 
in  a  specific  case,  arbitrarily  assign  the  tied-for  ranks  to  each  set 
of  tied  difference  scores  (including  zero),  then  give  each  of  the  resulting 
ranks  that  algebraic  sign  which  is  least  conducive  to  rejection  of  the 
null  hypothesis. 

It  has  sometimes  been  recommended  that  all  zero  differences 
be  discarded  and  n  be  reduced  accordingly.  The  reason  usually  given 
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is  that  power  is  greatest  if  zero  differences  are  treated  this  way. 
However,  the  ’Hncrease’*  in  power  is  quite  deceptive  since  the  increase 
in  the  probability  of  rejecting  a  false  null  hypothesis  is  paralleled  by  an 
inicrease  in  the  probability  of  rejecting  a  true  one.  The  latter  increase 
raises  the  actual  value  of  a  while  its  nominal  value  remains  the  same. 
The  increase  in  power  is  therefore  a  spurious  one  which  cannot  be  re¬ 
garded  as  an  advantage.  See ’^Treatment  of  Ties**  of  the  Sign  test. 

Efficiency.  Asymptotic  relative  efficiency,  compared  with 
Student^s  t-test  when  both  tests  are  applied  to  populations  meeting  all 
of  the  assumptions  of  the  t-test,  is  S/tt  or  ,  955,  The  corresponding 
efficiency  for  finite  samples  increases  with  decreasing  sample  size, 
becoming  as  high  as  ,  995  in  certain  cases.  See  3,  Test  for  Location  of 
the  Median, 

f.  Application.  Let  the  following  table  represent  data  collected 
in  the  application  of  treatments  to  pairs  of  rats  from  a  common  popula¬ 
tion,  the  pairing  having  been  done  on  the  basis  of  weight.  The  null  hy¬ 
pothesis  is  that  for  each  weight  category  the  two  treatments  have  effects 
which  are  either  identically  distributed  or  are  symmetrically  distributed 
about  the  same  median.  The  alternative  hypothesis  is  that  in  one  or 
more  weight  categories  the  two  treatment  effects  do  not  have  common 
medians  and  means. 


Treatment  A 

42 

37 

63 

27 

46 

49 

54 

39 

46 

101 

Treatment  B 

42 

37 

59 

34 

38 

40 

43 

25 

32 

33 

A-B  Difference 

0 

0 

4 

-7 

8 

9 

11 

14 

14 

68 

Magnitude  ranks 

1 

1  1/2, 

,  1  1/2  ' 

3 

4 

5 

6 

7 

8  1/2 

8 

1/2 

10 

Signed  ranks 

1 

1  1/2 

1 

1-1  1/2 

3  , 

-4 

5 

6  1 

7 

8  1/2 

8 

1/2 

10 

The  sum  of  the  negatively  signed  ranks  is  -5  1/2.  A  negative  sum 
that  small  or  smaller  can  be  obtained  in  the  following  ways: 
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1  1/2 

1  1/2 

3 

4  -5  * 

6 

7 

8  1/2 

8  1/2 

10 

1  1/2 

1  1/2 

3 

-4 :  5  ! 

6 

7 

8  1/2 

8  1/2 

10 

1  1/2 

-1  1/2 

3 

-4  5  ; 

6 

7 

8  1/2 

8  1/2 

10 

-1  1/2 

1  1/2 

3 

-4  '  5^ 

6 

7 

8  1  /2 

8  1/2 

10 

1  1/2 

1  1/2 

-3 

4  i  5  ' 

6 

7 

8  1/2 

8  1/2 

10 

1  1/2 

-1  1/2 

-3 

4  '  5  ^ 

^  6 

7 

8  1/2 

8  1/2 

10 

-1  1/2 

1  1/2 

-3 

4  ;  5^ 

6 

7 

8  1  /2 

8  1  /2 

10 

1  1/2 

-1  1/2 

3 

4  i  5  i 

;  6 

7 

8  1/2 

8  1/2 

10 

-1  1/2 

-1  1/2 

3 

4 :  5 

6 

7 

8  1/2 

8  1  /2 

10 

-1  1/2 

1  1/2 

3 

4  !  5 

6 

7 

8  1/2 

8  1  /2 

10 

1  1/2 

1  1/2 

3 

4  ■  5 

1  6 

7 

8  1/2 

8  1/2 

10 

i  r 


Thus  1 1  of  the  1024  possible  assignments  of  algebraic  sign  to  the  ranks 
shown  above  lead  to  a  negative  sum  as  small  or  smaller  than  that  de¬ 
rived  from  the  obtained  sample.  The  significance  level  for  a  one- 
tailed  test  of  the  hypothesis  that  treatment  A  produces  the  same  or 
less  **location**  effect  than  treatment  B  is  therefore  11/1024  or  slightly 
greater  than  ,01,  For  a  two-tailed  test,  there  would  be  22  assign¬ 
ments  giving  a  sum  with  absolute  value  as  small  as  that  obtained,  and 
the  significance  level  would  be  22/1024  or  approximately  .02,  In 
practice,  significance  levels  would  have  been  obtained  from  one  of  the 
many  tables  available  and  the  above  enumerations  would  have  been  un¬ 
necessary,  One  need  only  find  the  sum  of  the  positively  signed  ranks 
and  the  sum  of  the  negatively  signed  ranks  for  the  obtained  sample. 

The  smaller  of  these  two  sums  in  absolute  magnitude  is  referred  to 
prepared  tables, 

g.  Discussion,  In  analogy  with  the  treatment  of  Fisher*s 
test,  when  the  Wilcoxon  test  is  used  as  a  test  for  location  it  has  been 
assumed  that  the  two  sampled  populations  have  either  the  same  form 
or  forms  each  of  which  is  symmetrical.  This  means  that  **tr eatment**, 
if  it  produces  any  effect  at  all,  merely  causes  a  translation  or  slippage 
of  one  distribution  relative  to  the  other  along  the  x-axis.  Such  uncom¬ 
plicated  treatment  effects  are,  in  fact,  seldom  encountered  since  fac¬ 
tors  affecting  the  location  of  a  distribution  tend  also  to  affect  its  dis¬ 
persion  and  form.  It  is  reasonable  enough  to  consider  that  the  popula¬ 
tions  have  either  identical  or  symmetrical  forms  if  the  null  hypothesis 
is  true  because  a  true  null  hypothesis  implies  that  one  of  these  condi¬ 
tions  exists  (and  implies  further  that  they  have  identical  location  para¬ 
meters).  A  false  null  hypothesis  does  not  imply  it.  Since  the  assump- 
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tion  is  an  unrealistic  one,  it  is  of  interest  to  examine  the  likelihood 
that  its  failure  to  be  met  will  cause  false  acceptance  of  the  alterna¬ 
tive  hypothesis  that  the  populations  differ  in  location. 

If  the  assumption  is  dropped,  then,  when  the  null  hypothesis  is 
false,  the  true  situation  may  be  described  by  one  of  a  number  of  alter¬ 
native  hypotheses:  (a)  the  two  populations  differ  in  all  location  para¬ 
meters  and  have  symmetrical  or  identical  forms,  (b)  the  two  popula¬ 
tions  differ  in  all  location  parameters  and  do  not  have  symmetrical 
or  identical  forms,  (c)  the  two  populations  differ  in  certain  location 
parameters  but  not  others  and  do  not  have  symmetrical  or  identical 
forms,  (d)  the  two  populations  have  identical  location  parameters  and 
do  not  have  symmetrical  or  identical  forms.  If  either  (a)  or  (b)  is 
true  the  experimenter  does  not  err  in  accepting  the  alternative  hypo¬ 
thesis  that  the  two  populations  differ  in  location.  If  (d)  were  true  it 
would  mean  that  two  populations  in  each  of  which  mean  and  median 
differed  (because  the  populations  are  not  symmetrical)  had  equal  means 
and  equal  medians  but  different,  asymmetrical  forms.  This  requires 
the  unlikely  coincidence  that  two  curves  with  different  contours  either 
cross  or  touch  at  each  of  two  specified  points.  The  probability  for 
(d)  is  therefore  obviously  very  small.  For  (c)  however  it  is  required 
only  that  different  curves,  at  least  one  of  which  is  asymmetrical,  cross 
or  touch  at  one  of  certain  specified  points.  Thus  the  two  populations 
may  have  equal  means  but  unequal  medians  or  the  reverse.  Case  (c), 
therefore,  is  not  at  all  improbable,  and  it  raises  the  question,  **To 
which  location  parameter  is  the  test  most  sensitive?** 

Fisher ^s  test  took  the  mean  difference  as  its  test  statistic  and, 
in  effect,  took  extreme  mean  differences  as  its  rejection  region.  The 
mean  difference  is  the  same  as  the  difference  between  sample  means. 
There  is  therefore  a  direct  relationship  between  the  test  statistic  and 
the  difference  between  populations  means.  Fisher^s  test,  therefore, 
would  be  expected  to  be  most  sensitive  to  differences  between  means. 

The  situation  is  not  nearly  so  clear  cut  in  the  case  of  the  Wil- 
coxon  test.  Here  the  test  statistic  is  neither  the  difference  between 
means  nor  the  difference  between  medians,  nor  does  its  rejection  re¬ 
gion  consist  of  such  measures.  In  Fisher*s  test  the  average  differ¬ 
ence  score  is  also  the  difference  between  sample  means,  but  in  Wil- 
coxon*s  test  the  average  signed  rank,  which  is,  in  effect,  the  test 
statistic,  does  not  correspond  to  any  statistic  indicating  difference  in 
a  standard  location  parameter. 
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To  pursue  the  question  further,  if  no  assumptions  other  than 
continuity,  randomness  and  independence  were  made,  Fisher^s  test 
would  still  appear  to  be  a  reasonable  test  for  differences  in  means. 

The  Sign  test,  which  ignores  difference-score  magnitudes  and  uses 
only  their  direction,  i.  e.  ,  algebraic  sign,  is  obviously  the  appropriate 
analogous  test  for  difference  in  population  medians.  But  Fisher*s 
test,  the  Wilcoxon  test  and  the  Sign  test  all  use  the  signs  of  difference 
scores,  differing  primarily  in  the  weight  which  the  signs  are  given 
prior  to  summing.  For  the  Sign  test  the  weight  is  always  1,  for 
the  Wilcoxon  test  it  is  the  rank  of  the  difference-score’s  absolute 
magnitude,  and  for  Fisher’s  test  it  is  the  absolute  magnitude  itself. 

The  Wilcoxon  test  therefore  is  intermediate  between  a  test  sensitive 
only  to  differences  in  medians  and  a  test  sensitive  primarily  to  dif¬ 
ferences  in  means.  Under  the  limited  assumptions  listed  above, 
therefore,  the  Wilcoxon  test  should  be  considered  sensitive  to  both 
differences  in  medians  and  differences  in  means.  Without  the 
assumption  of  symmetrical  or  identical  forms,  therefore,  it  would 
be  futile  to  attempt  to  specify  which  location  parameters  differ  when 
the  null  hyp'othesis  is  rejected. 

Both  Fisher’s  and  Wilcoxon’s  test  test  the  null  hypothesis 
that  for  every  matched  pair  the  observations  come  from  identical 
populations  or  populations  symmetrically  distributed  about  a  common 
point.  It  is  not  assumed  that  the  members  of  every  matched  pair 
are  sampled  from  the  same  two  populations.  There  may,  in  fact, 
be  as  many  pairs  of  populations  as  there  are  difference  scores.  How¬ 
ever,  if  each  pair  of  units  be  regarded  as  equally  ’’important**,  i.  e.  , 
to  be  given  equal,  a  priori  weight  in  determining  whether  to  reject 
or  not,  another  assumption  is  required.  Under  the  conditions  stated, 
in  order  to  obtain  optimal  power  it  must  be  assumed  that  each  differ¬ 
ence  score  is  as  likely  to  have  been  obtained  from  one  matched  pair 
of  units  as  from  another.  This,  in  turn,  means  that  whatever  the 
variation  among  the  various  A-populations  or  among  the  n  different 
B-populations,  the  n  difference  scores  came  from  identical  differ¬ 
ence-score  populations. 

This  assumption  is  analogous  to  that  of  homoscedasticity. 
Without  the  assumption,  if  for  every  matched  pair  the  A  and  B  popula¬ 
tions  are  identical,  the  pairs  whose  AB  populations  have  greatest 
variance  are  the  pairs  most  likely  to  have  difference  scores  of  large 
magnitude.  These  particular  pairs  will  therefore  exert  greater  in¬ 
fluence  upon  the  outcome  of  the  test  than  will  those  whose  AB  populations 
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have  relatively  small  variance.  When  the  null  hypothesis  is  false, 
large  difference-score  magnitudes  resulting  from  real  treatment 
effects  may  tend  to  be  cancelled  out  by  large  difference-score  mag¬ 
nitudes  resulting  from  large  population  variances  and  having,  by 
chance,  the  opposite  sign.  The  power  of  the  test  is  therefore  affected 
adversely  when  the  assumption  is  not  met. 

It  has  sometimes  been  claimed  that  so  long  as  the  members 
of  each  pair  were  obtained  under  matched  conditions,  the  basis  for 
matching  may  vary  from  pair  to  pair.  It  is  clear  that  such  a  procedure 
is  quite  likely  to  result  in  unequal  population  variances  for  the  various 
A  populations  as  well  as  for  the  B  populations  and  thus,  probably,  for 
the  population  of  AB  differences.  Therefore  the  power  of  the  test  is 
likely  to  be  altered  in  such  a  way  that  the  matching  criteria  will  in¬ 
fluence  the  outcome  of  the  test  and  the  influence  of  certain  of  the  cri¬ 
teria  will  be  greater  than  that  of  others.  Furthermore,  a  certain 
ambiguity  arises  when  the  null  hypothesis  is  rejected  because  it  is 
not  clear  what  alternative  hypothesis  is  to  be  embraced.  A  sample 
of  variously  matched  scores  can  only  be  regarded  as  r epr e'senting 
a  multivariate,  or  at  least  "multiconditional*'  population.  Therefore, 
it  is  this  population  to  which  statistical  inference  must  be  extended, 
and  conclusions  must  lack  a  certain  specificity. 

To  summarize,  it  is  true  that  the  mathematical  basis  of  the 
Wilcoxon  test  does  not  require  the  assumption  that  all  paired  scores 
were  matched  on  the  basis  of  the  same  criterion.  However,  unless 
such  a  procedure  is  followed,  the  test  is  likely  to  be  biassed  in  the 
sense  that  certain  pairs  will  yield  difference-scores  with  greater  var¬ 
iance,  and  therefore  be  given  greater  influence  over  the  tests  outcome, 
than  others,  and  it  is  unlikely  that  the  experimenter  will  know  which 
pairs  are  so  favored.  This  unknown  and  unequal  influence  makes  inter¬ 
pretation  of  the  test  extremely  unclear  whether  the  null  hypothesis  is 
rejected  or  not.  And  if  the  null  hypothesis  is  rejected  it  is  not  clear 
what  alternative  hypothesis  to  accept  because  the  cause  of  rejection 
is  uncertain. 

h.  Tables.  Tables  can  be  found  in  53,  70,  72,  73,  and  in  some 
of  the  sources  listed  in  the  introduction.  For  cases  not  covered  by 
existing  tables,  exact  probabilities  may  be  calculated  by  the  method 
of  complete  enumeration,  or  approximate  probabilities  may  be  obtained 
from  normal  tables  by  treating  the  rank  sum  as  a  normal  deviate.  Let 
T  be  the  rank  sum  for  ranks  of  one  sign.  Then,  if  the  null  hypothesis 
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is  true, 
n(n+l) 

4 


T  comes  from  a  populaticm  of  rank  sums  whose  mean  is 

and  whose  variance  is  ^  (^+0  .  As  n  approaches  infinity, 

12 


the  distribution  of  T  approaches  the  normal  distribution.  Therefore 
the  approximate  probability  level  for  T  can  be  obtained  by  referring  the 


T  - 


n  (n+1) 


critical  ratio 


L  (n  + 

^ — rr 


1) 


-  to  normal  probability  tables.  The  approx¬ 


imation  is  reasonably  good,  when  n  is  large,  except  at  the  extreme 
tails  of  the  normal  distribution.  Therefore  extreme  levels  of  signi¬ 
ficance,  such  as  the  .  001,  should  not  be  adopted  when  the  normal  ap¬ 
proximation  is  used. 


i.  Sources.  53,  70,  71,  72,  73,  74.  See  also  5,  The  Wilcoxon 
Test:  Unmatched  Data. 


3.  Test  for  Location  of  the  Median 

a.  Rationale.  Let  n  observations  be  taken  from  a  continuous, 
symmetrically  distributed  population  and  let  the  population  median  be 
subtracted  from  each  observation.  Then  the  difference-scores  con¬ 
stitute  a  sample  of  size  n  from  a  continuously  distributed  population 
symmetrical  about  a  median  of  zero.  Therefore  each  of  the  n  dif¬ 
ference-scores  was  as  likely,  before  sampling,  to  be  positive  as  to 
be  negative.  And  since  the  populations  are  continuous,  zero  differ¬ 
ences  are  not  to  be  expected.  Now,  rank  the  difference  scores  in 
order  of  absolute  magnitude  and  give  each  such  rank  the  algebraic 
sign  of  the  difference-score  whose  magnitude  it  represents.  If  the 
true  population  median  was  subtracted  from  each  of  the  n  difference 
scores,  the  rank  sum  for  ranks  of  one  algebraic  sign  will  have  the 
same  distribution  as  that  tabled  for  the  Wilcoxon  m.atched  pairs  test. 

In  fact,  this  test  may  be  regarded  as  a  Wilcoxon  test  in  which  the  A- 
population  is  symmetrical  and  the  B-population  is  a  single  value,  the 
median  of  the  A -population. 

Actually  the  n  observations  need  not  be  taken  from  the  same 
population.  Each  observation  may  be  drawn  from  a  different  popula¬ 
tion  so  long  as  every  sampled  population  is  continuous  and  symmetrical. 
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b.  Null  Hypothesis^  Each  of  the  2^  unique  sets  of  signed  ranks, 
obtainable  by  arbitrarily  assigning  algebraic  signs  to  the  ranks  of  the 
difference-score  magnitudes,  is  equally  likely  to  have  resulted  from 
the  random  sampling  process.  This  will  be  the  case  if  all  assumptions 
are  met  and  if  all  sampled  populations  have  the  same  median. 

c.  Assumptions,  Random  and  independent  observations  and 

no  zero  differences,  or  preferably  continuously  distributed  populations, 
(For  reasons  see  1^  Fisher^s  Method  of  Randomization:  Matched  Pairs.) 
In  addition  it  is  assumed  that  every  sampled  population  is  symmetri- 
cally  distributed.  Therefore,  if  all  assumptions  are  met  the  null 
hypothesis  can  be  false,  i.  e.  ,  plus  and  minus  can  be  unequally  likely 
signs  for  a  difference  score,  only  because  the  subtracted,  hypothe¬ 
sized  median  is  not  the  true  population  median. 

d.  Treatment  of  Ties,  See  2,  The  Wilcoxon  Test:  Matched 

Pairs. 


e.  Efficiency,  Asymptotic  efficiency  relative  to  Student^s 
t  when  both  tests  are  applied  to  normally  distributed  populations  is 
S/tt  or  ,  955  (Pitman  quoted  in  53).  Small  sample  efficiency  for  same 
situation  appears  to  vary  between  ,  875  and  .  995  for  n  <  15  (53,  64,  65, 
66). 


f.  Application,  Subtract  the  single  hypothesized  median 
from  each  of  the  n  obtained  observations.  Apply  the  Wilcoxon  matched- 
pairs  test  to  the  difference  scores.  If  the  null  hypothesis  is  rejected, 
conclude  that  the  hypothesized  median  is  not  the  true  median  in  all  of 
the  populations  sampled. 

Alternatively,  apply  the  Walsh  test  (see  Discussion)  to  the 
difference  scores,  drawing  the  same  conclusion  if  the  null  hypothesis 
is  rejected. 


g.  Discussion.  Walsh  (64,  65,  66)  has  outlined  a  test  which 
Tukey  (53)  has  shown  to  be  equivalent  to  the  above  application  of  the 
Wilcoxon  test.  Walsh  assumes  populations  each  of  which  is  continuous 
and  symmetrical  and  tests  the  hypothesis  that  all  populations  have  a 
common  specified  median.  An  observation  is  drawn  from  each  popu¬ 
lation  and  the  n  observations  are  then  ranked  in  order  of  algebraic 
magnitude*  The  null  hypothesis  is  rejected  if  certain  order  statistics 
(depending  on  the  tail  or  tails  selected  for  the  rejection  region)  exceed 
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or  are  exceeded  by  the  hypothesized  median.  The  order  statistics  used 
are  the  averages  of  two  observations  of  specified  rank.  The  efficiency 
of  the  test  is  high,  being  the  same  as  that  of  the  Wilcoxon  test,  and  tables 
(64,  65,  66)  are  available  for  small  values  of  n.  Tukey  has  pointed 
out  that  the  Wilcoxon  test  is  easier  to  apply  when  testing  the  hypothesis 
of  a  common  median  of  specified  value,  while  the  Walsh  test  is  easier 
for  setting  confidence  limits  for  the  median.  This  follows  from  the 
manner  in  which  the  Walsh  test  is  applied:  the  null  hypothesis  is  re¬ 
jected  if  the  hypothesized  median  falls  above  or  below  a  difference 
score  of  a  certain  rank  or  the  average  of  two  difference  scores  whose 
ranks  are  specified.  The  Wilcoxon  test,  on  the  other  hand,  estab¬ 
lishes  confidence  limits  by  a  trial  and  error  method  (74).  See  53 
for  exact  Walsh  method. 

h.  Tables,  Tables  listed  under  2,  The  Wilcoxon  Test: 

Matched  Pairs,  are  appropriate.  Also  64,  65,  and  66  give  tables 
specifically  designed  for  this  application  and  particularly  appropriate 
for  setting  confidence  limits. 

i.  Sources,  53,  64,  65,  66. 


4.  Fisher's  Method  of  Randomization:  Unmatched  Data 

a.  Rationale.  K  two  samples,  of  sizes  m  and  n,  are  random 
samples  from  the  same  population,  they  may  be  regarded  as  a  single 
sample  of  size  m+n  which  has  been  drawn  from  the  parent  population 
and  then  divided  on  some  random,  i.  e.  chance,  basis  into  two  sub¬ 
samples  of  sizes  m  and  n.  If  the  observations  are  not  matched  or 
paired  in  any  way  and  if  no  observations  have  the  same  value,  there 

are  (  ^  )  different  ways  such  a  "split”  could  be  obtained,  and  each 

of  these  ways  is  equally  likely. 

Now  suppose  that  for  each  "way"  some  statistic,  say  the  mean, 

is  calculated  for  each  of  the  two  subsamples  and  the  difference  X  -  X 

A  B 

obtained,  the  subscripts  A  and  B  being  arbitrary  labels  to  identify  the 
two  subsamples.  If  N  of  these  -  X^  differences  equal  or  exceed 
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the  X 

A 


difference  for  the  actually  obtained  samples, 


then  the 


chance  probability  of  the  actually  obtained  -  X^  difference  or  one 
more  extreme  among  the  differences  calculated  for  the 

n 

•‘splits**  is  N/(^^^)  . 


If  the  two  original  samples  were  actually  obtained  under 
two  different  treatments,  then  if  the  treatments  have  equal  effects, 
the  samples  are,  in  effect,  samples  from  the  same  population.  Thus 
the  hypothesis  of  identical  treatment  effects  can  be  tested  at  the  oc 


level  of  significance  by  rejecting  the  hypothesis  if 


N/( 


m+n 

n 


)  <  oc. 


b.  Null  Hypothesis,  Each  of  the  different  pairs  of 

’*samples**  obtainable  by  dividing  the  total  of  m+n  observations  into 
two  sets,  one  containing  m  observations,  the  other  n  observations, 
is  equally  likely  to  have  been  obtained  in  the  experiment,  A  suf¬ 
ficient  condition  for  the  validity  of  the  null  hypothesis  is  that  the 
two  sampled  populations  are  identically  distributed.  This  will  be 
the  case  if  treatments  do  not  differ  in  their  measured  effects  on 
individuals  and  if  individuals  are  assigned  randomly  to  treatments. 

By  taking  as  the  rejection  region  the  N  pairs  of  sets  with  the  N 
greatest  mean  differences,  the  method  of  randomization  tests  the 
null  hypothesis  that  populations  are  identical  and  is  **most  sensitive** 
to  the  alternative  hypothesis  that  the  populations  have  different  means. 


c.  Assumptions,  Bias  in  the  sampling  process  or  possible 
influence  of  one  sampled  observation  upon  another  may  cause 

some  of  the  pairs  of  rearranged  samples  to  be  more  likely 


than  others  to  have  been  the  pair  actually  drawn.  And  this  may  be 
the  case  even  though  all  observations  in  both  samples  are  drawn  from 
the  same  population.  Therefore,  in  order  to  confine  the  cause  of 
unequal  probability  to  failure  of  the  null  hypothesis,  it  is  necessary 
to  assume  that  sampling  is  random  and  observations  are  independent. 


If  any  of  the  m+n  observations  have  the  same  value  there  will 
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be  less  than  distinguishable  rearrangements  of  observations 

into  samples  of  sizes  m  and  n.  Thus  the  sample  space  for  the  test 
statistic  will  be  smaller  than  that  represented  by  the  denominator  of 
the  probability  fraction.  In  order  to  ’’eliminate**  such  an  eventuality, 
it  is  assumed  that  there  are  no  tied  observations.  This  assumption  is 
sometimes  expressed  in  its  mafcematically  equivalent  form:  popula¬ 
tions  are  continuously  distributed. 

If  the  two  sampled  populations  do  not  have  identical  forms, 

the  pairs  of  hypothetical  samples  may  be,  and  probably  are, 

unequally  probable  even  though  the  two  populations  have  the  same  mean. 
For  example,  if  the  two  populations  are  normally  distributed  with  the 
same  mean  but  different  variances,  the  "splits”  which  give  the  more  ex¬ 
treme  observations  to  thfe  "sample"  from  the  population  with  the  greater 
variance  are  more  probable  than  are  the  "splits"  which  do  the  opposite. 
Furthermore,  if  the  two  populations  have  both  unequal  means  and  dif¬ 
ferent  forms,  the  inequality  of  means  may  bias  the  probability  in  one 
direction  and  the  dissimilarity  of  form  may  bias  it  in  the  opposite 
direction.  Thus  the  two  causes  of  unequal  probability  may  tend  to 
balance  one  another.  It  is  extremely  unlikely  that  this  balance  would 

be  complete,  leaving  each  of  the  pairs  of  samples  equally  prob¬ 

able,  However  the  power  of  the  test  would  be  adversely  affected.  In 
order,  therefore  to  confine  the  cause  of  failure  of  the  null  hypothesis 
to  inequality  of  population  means,  the  alternative  hypothesis,  it  is 
assumed  that,  whatever  their  location,  the  two  sampled  populations 
have  identical  forms. 


Since  the  last  named  assumption  is  a  fairly  unrealistic  one, 
the  experimenter  may  prefer  to  substitute  the  more  reasonable  assump¬ 
tion  that  if  population  means  are  equal  their  forms  are  identical.  Thus 
any  dissimilarity  of  form  must  be  accompanied  by  an  inequality  of  means, 
and  the  null  hypothesis  can  be  false  only  when  means  differ.  When  the 
null  hypothesis  is  false,  then  the  alternative  hypothesis  of  unequal  means 
must  be  true.  However,  the  power  of  the  test  to  detect  the  validity  of 
the  alternative  hypothesis  may  be  much  smaller  than  would  be  the  case 
if  identical  forms  could  be  legitimately  assumed. 


123 


d.  Treatment  of  Ties»  If  a  small  proportion  of  the  observa¬ 
tions  are  tied  it  may  be  reasonable  to  suppose  that  the  ties  are  attri¬ 
butable  to  the  discreteness  of  the  measuring  instrument  rather  than 
lack  of  continuity  in  the  distribution  of  the  thing  measured.  Therefore, 
treat  each  tied  observation  as  though  it  were  unique  in  determining  N, 

and  use  unaltered,  as  the  denominator  of  the  probability  fraction. 


e.  Efficiency,  High  efficiency  for  this  test  is  suggested  by 
the  high  efficiency  of  the  Wilcoxon  test  which  is  a  modification  of  it. 

See  5,  The  Wilcoxon  Test:  Unmatched  Data. 

f.  Application,  To  modify  an  example  given  by  Fisher,  sup¬ 
pose  that  the  height,  in  centimeters,  has  been  measured  for  8  English¬ 
men  and  7  Frenchmen,  and  that  it  is  desired  to  test  the  hypothesis  that 
Englishmen  and  Frenchmen  have  the  same  average  height, 

X 


Englishmen: 

188, 

182, 

17  8, 

177, 

176, 

174, 

173,  170 

177. 25 

F  renchmen: 

172, 

171, 

169. 

165, 

164, 

162, 

160, 

166. 14 

X  -  X  =  11,11 

E  F 


There  are  (  ^  )  or  6435  different  ways  of  reassigning  the  height  meas¬ 
urements  so  as  to  give  eight  of  them  to  Englishmen,  seven  to  French¬ 
men,  In  only  four  of  them  will  the  Englishmen’s  mean  exceed  the 
Frenchmen’s  mean  by  a  value  as  great  as  that  obtained  in  the  actual 
samples: 


X 


Englishm  en: 

188, 

182, 

178, 

177, 

176, 

174, 

173,  172 

177. 

50 

Frenchmen: 

171, 

170, 

169. 

165, 

164, 

162, 

160 

165. 

86 

X 

E  "  ' 

11. 

,64 

Englishmen: 

188, 

182, 

178, 

177, 

176, 

174, 

173,  171 

177. 

375 

Frenchmen: 

172, 

170, 

169, 

165, 

164, 

162, 

160 

166. 

00 

Xe  "  Xf  =  11,375 
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X 


Englishmen: 

188, 

182, 

178, 

177, 

176, 

174, 

173, 

170 

177. 

25 

Frenchmen: 

172, 

171. 

169. 

165, 

164, 

162, 

160 

166. 

14 

X^  -X 

F  ■ 

11. 

11 

Englishm  en: 

188, 

182, 

178, 

177, 

176, 

174, 

172, 

171 

177. 

25 

Frenchmen: 

173, 

170, 

169, 

165, 

164, 

162, 

160 

166. 

14 

x^-x^.  11.11 


Thus  the  significance  level  for  a  one  tailed  test  of  the  hypothesis  that 
the  average  Frenchman  is  as  tall  or  taller  than  the  average  Englishman 
is  4/6435  and  the  hypothesis  could  be  rejected  at  an  extreme  level  of 
significance.  Since  the  hypothesis  is  that  Englishmen  and  Frenchmen 
have  equal  average  heights,  there  are,  in  addition  to  the  four  ways,  in 
which  so  great  a  mean  difference  could  be  obtained  in  favor  of  the 
Englishmen,  the  following  four  ways  in  which  so  extreme  a  mean  dif¬ 
ference  can  be  found  in  favor  of  the  Frenchmen. 


X 


Englishmen: 

172, 

171, 

170, 

169, 

165, 

164, 

162, 

160 

166.  625 

Frenchmen: 

188, 

182, 

178, 

177, 

176, 

174, 

173 

178.  286 

^E- 

Xf  = 

-11. 661 

Englishmen: 

173, 

171, 

170, 

169, 

165, 

164, 

162, 

160 

166.750 

Frenchmen: 

188, 

182, 

178, 

177, 

176, 

174, 

172 

178.  143 

^E- 

-11. 393 

Englishmen: 

173, 

172, 

170, 

169, 

165, 

164, 

162, 

160 

166. 875 

Frenchmen: 

188, 

182, 

178, 

177, 

176, 

174, 

171 

178.  000 

^E- 

Xf  = 

-11. 125 

Englishmen: 

174, 

171, 

170, 

169, 

165, 

164, 

162, 

160 

166. 875 

Frenchmen: 

188, 

182, 

178, 

177, 

176, 

173, 

172 

178.  000 

^E- 

^F- 

-11.125 

125 


Thus  for  a  two-tailed  test  of  the  null  hypothesis  that  there  is  no  dif¬ 
ference  between  the  average  heights  of  Englishmen  and  Frenchmen, 
the  significance  level  is  8/6435, 

It  happened  in  the  above  example  that  the  number  of  mean 
differences  as  great  as  the  absolute  value  of  the  obtained  mean  dif¬ 
ference  is  the  same  for  positive  as  for  negative  mean  differences. 

This  is  certain  to  be  the  case  only  when  m  =  n.  When  the  two  samples 
are  of  unequal  size,  the  significance  level  for  a  two-tailed  test  is  not 
necessarily  twice  that  for  a  one-tailed  test,  because  symmetry  no 
longer  obtains. 

g.  Discussion.  Many  of  the  points  requiring  discussion  are 
highly  analogous  to  those  discussed  under  1,  Fisher^s  Method  of  Ran¬ 
domization:  Matched  Pairs  ;  therefore,  the  arguments  will  not  be  re¬ 
peated  here. 


Obviously  the  Method  of  Randomization  is  not  restricted  to 
testing  for  differences  between  means.  The  significance  of  a  variety 
of  ’^difference**  statistics  calculated  from  two  samples  can  be  tested 


by  ’’calculating**  the  statistic  for  each  of  the 


splits  and  taking 


as  the  rejection  region  those  N  splits  for  which  the  calculated  statistic 
has  the  N  most  extreme  values,  the  significance  level,  a,  being 


The  ’’most  extreme”  values  are  of  course  those  most  sug¬ 


gestive  that  the  alternative  hypothesis,  rather  than  the  null  hypothesis, 
is  true.  The  alternative  hypothesis  states,  in  effect,  that  the  popula¬ 
tion  statistic  corresponding  to  the  statistic  calculated  from  the  obtained 
samples  is  not  zero.  However,  unless  the  sample  statistic  can  be 
expected  to  ’’represent”  well  its  population  counterpart,  the  power  of 
the  test  may  be  very  small.  For  example,  the  method  could  not  be 
expected  to  provide  a  powerful  test  for  a  difference  in  population  ranges. 


Pitman  (41,  42,  43)  has  elaborated  upon  the  method  of  testing 
for  a  difference  between  population  means  and  has  applied  the  Method 
of  Randomization  to  testing  the  significance  of  a  correlation  coefficient 
(See  next  chapter)  and  to  testing  the  effect  of  treatments  in  an  analogy 
of  analysis  of  variance.  The  latter  problem  has  also  been  investi¬ 
gated  by  Welsh  (67,  68). 


126 


To  test  for  treatment  effects  in  analogy  with  analysis  of  var¬ 
iance,  Pitman  takes  m  batches  (letters)  of  n  individuals  each  of  which 
is  subjected  to  a  different  one  of  n  treatments  (numbers),  the  assign¬ 
ment  of  individuals  to  treatments  being  random.  The  scores  of  the 
individuals  can  be  represented  as  follows: 

^1  ’  ^2'  •  •  *  • 


m 


r 


m. 


m 

n 


If  there  is  no  treatment  effect,  the  n  scores  in  each  row  are  randomly 
placed  in  the  n  *^tr eatment"  columns.  There  are  nl  ways  in  which 
the  observations  in  a  row  can  be  permuted  and  since  there  are  m  rows, 

there  are  (n!  )^  tables  which  can  be  obtained  by  permuting  the  obser¬ 
vations  within  rows.  However,  some  of  these  tables  differ  only  in 
the  permutation  of  identical  columns.  This  can  be  prevented  by  per¬ 
mitting  permutation  of  observations  in  all  but  the  last  row.  Therefore, 

m  1 

there  are  (nl  )  ways  in  which  the  mn  observations  can  be  assigned 

so  that  each  column  contains  one  observation  from  each  batch  and  so 
that  no  two  assignments  are  identical  except  for  the  location  of  col¬ 
umns  with  respect  to  each  other.  Pitman  calculates  the  equivalent 
of  the  F  ratio  for  each  such  assignment  and  rejects  the  hypothesis  of 
no  treatment  effect  at  the  significance  level  a  =  N/(n!  if  the  F 

ratio  for  the  actually  obtained  sample  lies  among  the  N  most  extreme 
of  these. 


h.  Tables .  None.  Probabilities  must  be  calculated  for 
each  specific  case. 


i.  Sources.  4,  7,  16,  26,  27,  28,  34,  39i  40,  41,  42,  43, 
48,  67,  68,  75.  See  also  17  and  38  under  1,  Fisher^s  Method  of  Ran¬ 
domization:  Matched  Pairs. 
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5.  The  Wilcoxon  Test;  Unmatched  Data 


a.  Rationale,  Wilcoxon  has  modified  Fisher*s  method  by 
replacing  the  obtained  scores  with  their  ranks.  The  test  statistic, 
which  in  Fisher*s  method  was  the  difference  in  sample  means,  is, 
in  Wilcoxon’s  test,  the  rank  sum  for  the  smaller  sample,  or  when 
samples  are  of  equal  size,  the  smaller  of  the  two  sample  rank  sums. 
The  Wilcoxon  modification  has  advantages  similar  to  those  discussed 
in  "Rationale**  of  2,  The  Wilcoxon  Test:  Matched  Pairs;  the  test  is 
not  a  conditional  one  since  the  sample  space  for  the  test  statistic  is 
the  same  for  every  pair  of  samples,  the  test  is  less  sensitive  to  ex¬ 
treme  observations,  and  the  probabilities  can  be  tabled. 

b.  Null  Hypothesis.  Each  of  the  pairs  of  **artificial** 

samples  obtainable  by  arbitrarily  assigning  m  observations  to  one 
treatment,  n  to  the  other,  is  equally  likely  to  have  been  drawn  as  a 
pair  of  true  samples.  If  all  assumptions  are  met,  a  sufficient  con¬ 
dition  for  the  validity  of  the  null  hypothesis  is  that  the  two  samples 
come  from  identical  populations.  This  will  be  the  case  if  the  two 
treatments  do  not  differ  in  any  measured  respect, 

c.  As  sumptions.  It  is  assumed  that  sampling  is  random, 
observations  are  independent,  no  observations  are  tied  or  populations 
are  continuously  distributed,  populations  have  identical  forms  (or  at 
least  have  identical  forms  if  population  means  or  medians  are  equal). 
For  reasons,  see  4,  Fisher*s  Method  of  Randomization:  Unmatched  Data. 

d.  Treatment  of  Ties.  If  ties  are  due  only  to  imprecision 
of  measurement,  i.e.  ,  if  the  thing  measured  is  continuously  distri¬ 
buted,  then  ties  are  a  problem  only  when  members  of  a  tied  group 

lie  in  both  obtained  samples.  When  all  the  observations  have  a  given 
tied  value  lie  in  one  sample,  they  may  be  arbitrarily  assigned  the 
ranks  they  would  have  if  distinguishable.  If  observations  in  both 
samples  have  the  same  value,  one  technique  is  to  assign  tied  obser¬ 
vations  the  tied-for  ranks  least  conducive  to  rejection  of  the  null  hy¬ 
pothesis.  Another  technique  is  randomly  to  assign  to  the  members 
of  the  tied  group  the  ranks  they  would  have  if  distinguishable.  This 
preserves  the  mathematical  integrity  of  the  test,  but  forceably  and 
artificially  introduces  an  element  of  chance  which  must,  in  general, 
reduce  the  power  of  the  test. 
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The  most  frequently  recommended  technique  is  to  give  each 
of  the  members  of  a  tied  group  the  midrank  of  the  group,  i.  e.  ,  the 
average  of  the  ranks  the  tied  members  would  have  if  their  values  were 
distinguishable.  The  result  is  that  the  set  of  ranks  obtained  in  this 
manner  and  rank  sums  obtained  by  applying  Fisher*s  method  to  them 
are  not  the  same  as  the  set  of  ranks  and  rank  sums  used  (by  applying 
Fisher  ^s  method  to  the  m+n  different  integers  from  1  to  m+n)  to  cal¬ 
culate  the  probability  tables.  The  tables  therefore  are  inaccurate  in 
such  cases,  giving  not  the  true  probability  but  rather  the  probability  of 
the  average  value  taken  by  the  test  statistic  when  ties  are  broken  in 
all  possible  ways.  (If  all  the  observations  having  the  same  value  lie 
in  the  same  sample,  all  ways  of  breaking  ties  result  in  the  same  value 
for  the  test  statistic  and  the  tables  are  fully  applicable  if  discontinuity 
is  due  only  to  imprecision  of  measurement.  ) 

When  midranks  are  used  the  rank  sum  may  not  be  an  integer. 
The  tabled  rank  sums,  however,  are  integers.  Therefore,  it  is  sug¬ 
gested  that  when  the  obtained  rank  sum  is  not  an  integer  it  should  be 
raised  or  lowerecl  one  half  unit  so  as  to  assume  whichever  integral 
value  is  least  conducive  to  rejection  of  the  null  hypothesis.  This  pro¬ 
cedure  results  in  a  slightly  more  conservative  test. 

In  many  cases  the  effect  of  using  midranks  is  very  much  the 
same  as  if  tied  observations  were  assigned  consecutive  ranks  with 
the  ranks  carefully  apportioned  so  as  to  **balance**  the  apportionment 
between  the  two  samples.  For  example,  suppose  ten  observations 
are  tied  for  21st  to  30th  place  in  rank  and  two  of  the  observations 
are  in  sample  A,  the  remainder  in  sample  B,  In  "balancing**  one 
might  assign  the  ranks  24  and  27  to  the  two  observations  in  sample 
A  because  they  separate  the  ranks  21  to  30  into  nearly  equal  parts, 
or  21  and  30,  25  and  26  or  any  other  assignment  resulting  in  a 
‘‘symmetrical**  pattern  might  be  picked.  The  result  of  course  is 
that  in  every  case  the  average  of  these  ranks,  for  each  sample,  is 
the  midrank,  25  1/2.  Therefore,  when  "symmetrical  rank  patterns" 
can  be  obtained  without  resorting  to  nonintegral  ranks,  the  use  of 
midranks  is  equivalent  to  assigning  to  each  member  of  a  tied  group 
a  different  one  of  the  ranks  for  which  the  group  is  tied  and  doing  so 
in  such  a  way  that  each  sample  gets  its  "fair  share"  of  rank  magni¬ 
tude.  If  the  rank  sum  is  an  integer  the  tables  give  the  exact  prob¬ 
ability  under  the  assumption  that  one  of  the  possible  "equitable  ap¬ 
portionments"  is  the  correct  one.  In  the  long  run  the  average 
difference  between  this  probability  and  the  true  probability  will  tend 
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to  be  zero;  however,  in  any  specific  experiment  a  discrepancy  of 
zero  is  quite  unlikely*  Therefore,  for  the  particular  experiment 
under  test  the  probability  of  false  rejection  of  the  null  hypothesis 
may  be  greater  or  less  than  that  indicated  by  the  tables*  Regard¬ 
less  of  whether  or  not  **symmetry**  can  be  obtained  with  integers, 
the  limits  of  ^*tie-error  **  can  easily  be  founds  This  is  accomplished 
by  assigning  tied  observations  the  "tied-for”  ranks  least  conducive 
to  rejection  of  the  null  hypothesis,  performing  the  conservative  test, 
then  assigning  them  the  ranks  most  conducive  to  rejection  and  perform¬ 
ing  the  radical  test,  thereby  obtaining  bounds  for  the  influence  of  ties 
on  probability  levels.  This  procedure  has  been  recommended  by 
van  der  Vaart  (58)  who  observes  that  if  the  chosen  significance  level 
does  not  lie  between  these  bounds  there  is  no  problem  and  if  it  does, 
there  is  no  solution.  He  adds  that  precisely  the  same  dilemma 
arises  when  ties  occur  in  the  application  of  Student*s  t-test  although 
**this  fact  has  always  passed  unnoticed. 

When  samples  are  so  large  that  tables  are  inapplicable  the 
normal  approximation  is  generally  used.  The  difference  between  the 
obtained  and  the  expected  rank  sum  is  divided  by  the  standard  devia¬ 
tion  of  the  rank  sum,  and  the  resulting  critical  ratio  is  treated  as  a 
normal  deviate  with  zero  mean  and  unit  variance  and  referred  to  nor¬ 
mal  probability  tables.  When  ties  are  given  the  midrank,  the  pres¬ 
ence  of  ties  has  no  effect  upon  the  expected  rank  sum,  but  does 
affect  the  variance,  causing  it  to  be  smaller  than  would  be  the  case 
if  there  were  no  ties.  There  is  a  formula,  however,  which  takes 
account  of  ties  in  calculating  variance  and  therefore  ^'corrects**  for 
ties  when  used  in  calculating  the  critical  ratio.  This  formula  re¬ 
quires  that  the  Mann-Whitney  form  of  the  Wilcoxon  test  be  used 
(See  6,  The  Mann  Whitney  Test). 

e.  Efficiency.  The  value  S/tt  or  .955  has  been  obtained 
for  the  asymptotic  efficiency  of  the  Wilcoxon  test  relative  to  Student*s 
t-test  when  both  tests  are  applied  to  samples  from  normally  distri¬ 
buted  populations  with  homogeneous  variances.  This  value  has  been 
obtained  by  a  number  of  authors  (9,  11,  37,  50,  59,  62),  Pitman  (not 
referenced)  apparently  having  been  the  first,  and  is  true  of  both  one¬ 
sided  and  two-sided  tests  under  several  different  definitions  of 
asymptotic  efficiency.  Hodges  and  Lehmann  (23)  have  shown  that 
the  asymptotic  relative  efficiency  of  the  two-sample  Wilcoxon  test 
relative  to  Student*s  t  cannot  fall  below  .  864  when  both  are  used  as 
tests  against  shift  of  a  continuous,  but  otherwise  unspecified,  distri¬ 
bution  function.  (The  comparison  is  less  favorable  to  the  Wilcoxon 
test  when  shift  is  accompanied  by  ^contaminations**).  They  conclude 
that  to  the  extent  that  the  concept  of  asymptotic  relative  efficiency 
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‘‘adequately  represents  what  happens  for  the  sample  sizes  and  alter¬ 
natives  arising  in  practice,  this  result  shows  that  use  of  the  Wilcox- 
on  test  instead  of  Student*s  t-test  can  never  entail  a  serious  loss  of 

efficiency  for  testing  against  shift.  (On  the  other  hand . the 

Wilcoxon  test  may  be  infinitely  more  efficient  than  the  t-test.  )“ 

In  fact  Pitman  is  quoted  (23,  47)  as  having  found  an  A.  R.  E.  of  1 
for  Wilcoxon*s  relative  to  Student*s  test  when  both  were  applied  to 
uniform  distributions.  Pitman  (23,  47)  and  Pitman  and  Noether  (7) 
are  quoted  as  having  foimd  the  A.  R.  E.  of  Wilcoxon‘s  relative  to 
Student*s  test  to  be  considerably  greater  than  1  when  the  two  tests 
were  applied  to  certain  types  of  distributions.  Similar  results  have 
also  been  found  for  small  samples.  Student*s  test  has  been  found  to 
have  power  inferior  to  that  of  the  Wilcoxon  test  for  testing  samples 
of  4  and  6  observations  from  certain  uniform  distributions  (63)  and 
for  testing  samples  of  5  and  5  from  certain  distributions  differing  in 
peakedness  (Whitney  quoted  in  1). 

When  both  tests  are  applied  to  samples  from  normal  popula¬ 
tions  with  homogeneous  variances,  Student*s  test  has  invariably  been 
found  to  have  power  as  great  or  greater  than  Wilcoxon*s;  however, 
the  difference  in  efficiency  has,  with  one  exception,  always  been  very 
slight  (9,  23,  52,  59,  60,  62),  The  exception  (9)  has  been  criticized 
(23)  as  attributable  to  a  procedural  artifact. 

The  evidence  therefore  supports  the  conclusion  that  Student's 
t-test  is  statistically  more  efficient  than  Wilcoxon's  test  when  the 
assumptions  of  the  t-test  have  been  completely  met,  but  that  the  super¬ 
iority  of  the  t-test  is  slight,  amounting  to  less  than  5%,  When  Stu¬ 
dent's  assumptions  have  not  been  fully  met,  either  test  may  be  the 
more  powerful,  depending  upon  a  number  of  factors.  However,  if 
it  is  known  that  the  populations  have  identical,  continuous  forms  when 
their  location  parameters  are  equal  (i,  e,  that  if  treatments  have  dif¬ 
ferent  effects,  these  include  effects  upon  means  and  medians),  or  if 
the  experimenter  is  interested  in  detecting  any  discrepancy  between 
continuously  distributed  populations  (i,  e,  ,  any  type  of  treatment  effect), 
then  the  Wilcoxon  test  is  preferable.  Rejection  of  the  null  hypothesis 
can  occur  only  because  of  the  existence  of  the  effect  in  which  the  ex¬ 
perimenter  is  interested  or  because  of  chance  with  probability  of 
exactly  oc.  If  Student's  test  were  used  in  the  same  cases,  rejection 
could  occur  because  of  (a)  the  effect  whose  detection  is  desired, 

(b)  nonnormality,  (c)  chance,  with  probability  other  than  oc  (and  un¬ 
known)  unless  the  populations  are  known  to  be  normal. 
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The  Wilcoxon  test  is  one  of  the  most  powerful  distribution- 
free  tests.  Tests  designed  by  Terry  and  van  der  Waerden,  and  dis¬ 
cussed  in  the  Introduction  and  in  (7)  are  slightly  more  efficient,  in  the 
statistical  sense,  for  certain  test  situations.  However,  they  lack 
the  Wilcoxon  test's  conceptual  simplicity  and  ease  of  application.  In 
several  investigations  of  the  power  of  distribution-free  tests  with 
respect  to  each  other,  the  Wilcoxon  test  has  invariably  been  found 
to  be  most  powerful  or  among  the  most  powerful  (See  Table  II  in 
Introduction). 

Mann  and  Whitney  (35)  showed  that  the  Wilcoxon  test  is  con¬ 
sistent  "with  respect  to  the  class  of  alternatives  f  (x)  >  g  (x)  for  every 
x",  i.  e.  ,  is  consistent  if  the  alternative  to  the  null  hypothesis  of  iden¬ 
tical  populations  is  that  the  cumulative  distribution  of  one  population 
lies  entirely  above,  i.  e.  does  not  cross,  that  of  the  other.  Van  Dant- 
zig  (6)  and  Lehmann  (32)  have  pointed  out  that  Mann  and  Whitney's 
proof  actually  is  more  general.  It  proves  the  test  consistent  if,  when 
the  null  hypothesis  is  false,  the  probability  that  a  random  observation 
from  one  population  exceeds  one  from  the  other  population  differs  from 
1/2  (for  a  two-tailed  test  or,  for  a  one-tailed  test,  differs  from  1/2 
in  a  specified  direction)  (30).  The  above  results  require  that  the 
ratio  m/n  remain  constant  as  n  oo  .  Putter  (44)  has  shown  that, 
under  the  same  conditions,  if  the  populations  are  discontinuous  and 
Pr  (x  >  y)  +  1/2  Pr  (x  =  y)  >1/2  the  test  will  be  consistent  if  ties 
are  randomized,  i.  e.  ,  if  ties  in  each  group  of  tied  observations  are 
randomly  assigned  the  tied-for  ranks. 

Lehmann  (32)  has  proved  that  the  Wilcoxon  test  is  unbiassed 
when  it  is  used  as  a  one-tailed  test,  more  specifically  it  is  unbiassed 
for  the  class  of  alternatives  F  (x)  >  G  (x)  for  every  x.  Van  der  Vaart 
(55,  59)  has  shown  that  the  two-tailed  Wilcoxon  test  may  be,  but  is 
not  necessarily,  biassed.  The  likelihood  of  such  bias  appears  to  be 
greater  when  samples  are  of  unequal  size  and  when  populations  are 
skewed. 


Mann  and  Whitney  (35)  showed  that  their  mathematically 
equivalent  test  statistic  is  asymptotically  normally  distributed  under 
the  null  hypothesis  if  m  and  n  approach  infinity  in  any  arbitrary  man¬ 
ner.  Lehmann  (32)  has  found  that  it  is  also  asymptotically  normally 
distributed  when  the  populations  differ  provided  that  the  ratio  m/n 
remains  constant  as  m  and  n  approach  infinity.  Stoker  (49)  states 
that  Lehmann's  proof  also  applies  when  populations  are  discontinuous. 


132 


Asymptotic  normality  has  also  been  proven  by  Haldane  and  Smith  (20). 


f.  Application.  Suppose  that  gain  in  weight  has  been  measured 
under  two  different  diets  with  the  following  results  for  six  individuals 
subjected  to  Diet  A  and  seven  persons  given  Diet  B. 


Diet  A 

Weight  Gain  Rank 


Diet  B 

Weight  Gain  Rank 


-14 
-12 
-12 
-10 
-  2 
2 


Sum  23 


There  are  (  ^  ) 


1 

2  1/2 
2  1/2 
4 
6 
7 


-3 

5 

7 

8 

9 

15 

24 


5 

8 

9 

10 
1 1 
12 
13 

Sum  68 


or  1716  ways  of  redistributing  the  scores 


into  samples  of  sizes  6  and  7.  Of  these,  there  are  only  four  ways 
in  which  Diet  A  could  obtain  a  rank  sum  equal  to  or  smaller  than  the 
obtained  rank  sum  of  23,  They  are  as  follows  (only  the  ranks  being 
shown): 


1,  2  1/2,  2  1/2,  4,  5,  6 

1,  2  1/2,  2  1/2,  4,  5,  7 

1,  2  1/2,  2  1/2,  4,  5,  8 

1,  2  1/2,  2  1/2,  4,  6,  7 


S  =  21 
2  =  22 
2  =  23 
2  =  23 


The  significance  level  for  a  one-tailed  test  of  the  hypothesis  that  Diet  A 
causes  the  same  or  more  weight  gain  than  Diet  B,  therefore,  is  4/1716 
or  about  . 0023. 

Since  the  samples  are  of  unequal  size,  a  two-tailed  test  raises 
the  question  of  which  rank  sums  to  consider  as  extreme  in  the  opposite 
direction.  Obviously  they  cannot  be  those  totaling  to  68  or  more  for 
Diet  A,  because  that  number  was  obtained  for  Diet  B  as  the  sum  of 
seven  ranks,  while  for  Diet  A  only  six  ranks  can  be  summed.  The 
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solution  proposed  by  White  (69)  is  to  rerank  the  observations,  this 
time  ranking  the  largest  observation  1,  the  next  largest,  2,  etc.  ; 
then  the  number  of  ways  of  redistributing  scores  which  cause  Diet  A 
to  have  a  rank  sum  of  23  or  smaller'are  those  whose  rank  sums  are  as 
extreme  or  more  extreme  in  the  ^'opposite  direction.  There  are, 
in  fact,  four  such  ways  and  the  probability  level  for  a  two-tailed  test 
is  therefore  8/1716.  However,  the  reranking  need  not  actually  be 
performed  because  the  test  statistic  is  symmetrically  distributed 
and  the  probability  level  for  a  two-tailed  test  is  simply  twice  that 
for  a  one-tailed  test. 

In  practice,  of  course,  probabilities  would  generally  not 
be  obtained  by  applying  the  method  of  randomization,  but  would  be 
obtained  from  tables.  In  that  case,  only  the  rank  sums  need  be  ob¬ 
tained.  The  use  of  tables  varies  considerably,  however,  from  one 
table  to  another,  and  the  particulars  of  application  will  not  be  des¬ 
cribed  here. 

g.  Discussion.  Various  forms  of  the  Wilcoxon  test  have 
been  published  by  a  variety  of  authors.  Wilcoxon  developed  the  test 
for  the  case  where  samples  are  of  equal  size,  i.  e.  ,  m  =  n.  White 
(69)  extended  the  test,  and  tabled  it,  to  the  case  of  unequal  sample 
sizes.  This  was  also  done  by  van  der  Reyden  (45)  at  about  the  same 
time,  but  apparently  without  knowledge  of  the  work  of  either  Wilcoxon 
or  White.  A  test,  conducted  differently,  but  mathematically  equi¬ 
valent  to  the  Wilcoxon  test,  was  developed  independently  by  Festinger 
(15)  and  published  very  soon  after  Wilcoxon*s  original  article.  Fest¬ 
inger  took  as  his  test  statistic  the  absolute  difference  between  the 
average  rank  for  the  smaller  sample  and  the  average  rank  for  the  com¬ 
bined  sample  of  m+n  ranks.  Since  the  latter  is  a  constant  (equal  to 

mj-n-fl  ^  fixed  values  of  m  and  n,  and  since  the  average  rank  for 

the  smaller  sample  is  simply  its  rank  sum  divided  by  its  size,  Fest¬ 
inger*  s  test  is  mathematically  equivalent  to  White’s  extension  of  the 
Wilcoxon  test.  Because  of  the  additional  computation  required  to 
obtain  the  test  statistic, d  ,  Festinger* s  test  is  more  time  consuming  than 
the  Wilcoxon  test.  A  Wilcoxon-like  test  was  developed  by  Haldane  and 
Smith  (20,  see  also  3  and  24)  for  a  specific  application.  Finally,  a 
modified  form  of  the  Wilcoxon  test  developed  by  Mann  and  Whitney  (35) 
has  become  the  most  widely  used  form  of  the  test.  It  is  discussed 
in  the  next  section.  Because  of  the  mathematical  relationships  exist¬ 
ing  between  the  Wilcoxon,  White,  van  der  Reyden,  Festinger  and 
Mann-Whitney  tests,  they  have  common  mathematical  properties  of 
efficiency,  consistency,  asymptotic  normality,  etc. 


134 


The  Wilcoxon  test  actually  tests  whether  or  not  two  papula¬ 
tions  are  identical.  The  test  becomes  a  test  for  equal  means  (or 
substitute  "medians**)  if  it  can  be  legitimately  assumed  either  (a) 
that  whatever  their  locations  the  populations  have  identical  forms, 
or  (b)  that  if  their  means  (or  substitute  **medians**)  are  equal  the 
populations  have  identical  forms,  i,  e.  ,  the  populations  are  identical. 
The  latter  assumption  is  generally  far  more  realistic  than  the  former; 
however,  the  test  may  have  les  s  power  if  only  the  latter  assumption 
can  be  made.  See  **Assumptions**  under  Section  4,  Fisher*s  Method 
of  Randomization:  Unmatched  Data. 

Wilcoxon  (71,  72,  73,  74)  has  extended  his  test  to  permit 
a  single  test  of  data  collected  under  several,  different,  non-tested 
experimental  conditions.  Under  each  of  k  non-tested  conditions, 
n  observations  are  taken  under  treatment  A  and  n  observations  under 
treatment  B.  Then,  except  for  ♦'he  last  step  of  determining  signifi¬ 
cance  levels,  the  ordinary  Wilcoxon  test  is  performed  for  each  non- 
tested  condition  independently.  This  results  in  a  rank  sum,  based 
on  n  ranks,  for  treatment  A,  and  one  for  treatment  B,  under  each  of 
the  k  non-tested  conditions.  The  sum  of  the  k  rank  sums  is  then  ob¬ 
tained  for  each  treatment  and  the  smaller  of  these  is  referred  to  a 
brief,  specially  prepared  table  of  probabilities.  The  test  is  legiti¬ 
mate  (  as  a  test  for  simple  treatment  effects)  provided  that  when  the 
k  non-tested  conditions  have  different  effects  upon  observations,  any 
given  condition  has  the  same  effect  upon  observations  taken  under  one 
treatment  as  it  has  upon  observations  taken  under  the  other.  That  is 
to  say,  there  must  be  no  interaction  between  treatments  and  non- 
tested  conditions.  If  this  implicit  assumption  is  not  met,  the  power 
of  the  test  may  be  adversely  affected  and  when  the  null  hypothesis 
(that  each  of  the  k  B-populations  has  the  same  form  and  location  as 
its  A-population  counterpart)  is  false,  the  true  alternative  hypothesis 
will  be  unable  to  be  specified  in  other  than  very  general  terms. 

h.  Tables.  Tables  can  be  found  in  45,  69,  70,  72,  7  3  for 
the  Wilcoxon  or  rank  sum  form  of  the  test,  in  15  for  Festinger*s 
differ ence -in-average -rank  form,  and  in  1,  35,  46  (and  see  also  18 
and  36)  for  the  Mann-Whitney  form  of  the  test.  Tables  for  Wilcoxon*s 
application  of  his  test  to  data  collected  under  a  variety  of  non-tested 
conditions  are  in  71,  72  and  73,  Tables  can  also  be  found,  reproduced, 
in  some  of  the  sources  listed  in  the  Introduction. 

Several  of  these  tables  have  been  found  to  contain  errors. 
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Auble's  tables  have  been  criticized  by  Fix  and  Hodges  (18),  Festinger's 
tables  by  Kruskal  and  Wallis  (30),  van  der  Reyden's  tables  by  Kruskal 
and  Wallis  (31),  and  White's  Tables  by  Fix  and  Hodges  (18)  and  Kruskal 
and  Wallis(31). 


For  cases  not  covered  by  existing  tables,  probabilities  may 
be  obtained  by  the  method  of  randomization,  or  the  rank  sum  may  be 
treated  as  a  normal  deviate  and  approximate  probabilities  may  be  ob¬ 
tained  by  referring  a  critical  ratio  to  normal  tables.  Let  T  be  the 
rank  sum  for  the  sample  with  m  observations.  Then,  if  the  null  hy¬ 
pothesis  is  true,  T  comes  from  a  population  of  rank  sums  whose 


m-  ,m+n+l.  j,  2. 

mean  T  is  m  ( - ^ ^ whose  variance  cr^  is 


m  n  )  .  As  m  and  n  increase,  the  distribution  of  T  ap- 

12 


proaches  the  normal  distribution.  Therefore,  the  approximate 
probability  level  for  T  can  be  obtained  by  referring  the  critical  ratio 


„  ,  m  +  n  +  1  . 

T  -  m  ( - 5 - ) 


F  '  m  +  n  +  I 

'v/  m  n  ( - ^ 


to  normal  probability  tables.  The  approximation 


is  reasonably  good,  when  m  and  n  are  large,  except  at  the  extreme 
tails  of  the  normal  distribution.  Therefore  extreme  levels  of  signifi¬ 
cance,  such  as  the  .001,  should  not  be  adopted  when  the  normal  ap¬ 
proximation  is  used. 


If  T  is  the  rank  sum  for  the  sample  with  m  observations  when 
the  smallest  rank  is  assigned  to  the  smallest  observation,  and  T'  is 
the  rank  sum  for  the  same  sample  when  the  smallest  rank  is  assigned 
to  the  largest  observation,  then  T'  =  m(m  +  n  +1)  -  T.  This  is  easily 
seen;  If  r  is  the  rank  of  one  of  the  m  observations  in  the  first  case 
and  r'  is  the  corresponding  rank  in  the  second  case,  then  r'  =  m  +  n- 
(r  -  I)  =  m  +  n  +  1  -  r.  And  since  T'  =  r',  then  T'  =  (m  +  n  + 

1)  -  r  =  m  (m  +  n  +  I)  -  r  =  m  (m  +  n  +  1)  -  T.  This  formula 
saves  the  labor  of  reranking  when  tables,  such  as  White's,  require 
the  smaller  of  the  two  T  values. 


i. 

Sources 

:  1, 

2. 

3,  5 

.  6 

18, 

19, 

20, 

TZV 

23, 

24, 

25, 

29. 

45, 

46, 

47, 

49. 

50, 

51, 

52, 

54, 

55, 

69, 

70, 

71, 

72, 

73, 

74. 

7,  8,  9,  10,  11,  12,  13,  14,  15, 
30,  31,  32,  33,  35,  36,  37,  44, 
56,  57,  58,  59,  60,  61,  62,  63, 
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6.  The  Mann  Whitney  Test 

a.  Rationale.  Let  a  sample  of  n  observations,  designated 
as  Xs,  and  a  sample  of  m  observations,  identified  as  Ys,  be  taken 
from  the  same  continuously  distributed  population.  Now  arrange 
the  m  +  n  observations  in  order  of  increasing  size  irrespective  of 
sample.  Then  replace  each  ordered  observation  with  an  X  or  a  Y 
depending  on  the  sample  from  which  it  originally  came.  The  result 
will  be  a  pattern  of  n  X^s  and  m  Y*s  intermbced. 


If  these  m  +  n  units  were  all  different,  there  would  be  (m  +  n)i 
distinguishable  patterns.  However,  for  each  actually  distinguishable 
pattern  there  are'n!  permutations  of  Xs  with  each  other  which  do  not 
change  the  pattern,  and  for  each  of  these  permutations  there  are  mi 
permutations  of  Y*s  with  each  other  which  do  not  change  the  pattern. 


Therefore, 


there  are 


(m  -f  n)i 
mi  nl 


,m  +  n. 
or  (  ) 

'  m  ' 


distinguishable  patterns  of 


n  Xs  and  m  Ys.  If  the  two  samples  are  drawn  from  the  samfe  popula- 
tion  each  of  these  patterns  is  equally  likely.  However,  if  they  come 
from  different  populations,  the  patterns  should  be  unequally  likely,  and 
if  the  populations  differ  in  location  only,  one  would  anticipate  patterns 
in  which  Xs  tended  to  cluster  at  one  end,  Y s  at  the  other. 


The  test  statistic,  U,  therefore  is  the  number  of  times  a 
Y  precedes  am  X.  Thus,  U  is  the  number  of  Ys  preceding  the  small- 
est  X,  plus  the  number  of  Y s  preceding  the  next  smallest  X  (and  there¬ 
fore  including  all  of  the  Ys  counted  in  the  first  case),  etc.  ,  until  the 
number  of  Y s  preceding  each  X  are  counted  and  summed  for  all  Xs. 

The  probability  of  U,  when  the  null  hypothesis  is  true,  is  simply  the 

proportion  of  the  possible  patterns  which  result  in  Us  as 

extreme  or  more  extreme  than  that  obtained. 

m  +  n 

b.  Null  Hypothesis.  Each  of  the  (  ^  )  patterns  of  Xs  and 

Ys,  representing  their  observations  arranged  in  order  of  increasing 
algebraic  magnitude,  is  equally  likely.  A  sufficient  condition  for  the 
validity  of  the  null  hypothesis  is  that  the  two  samples  were  drawn  ran¬ 
domly  and  independently  from  identical  continuously  distributed 
populations. 

c.  Assumptions:  See  5,  The  Wilcoxon  Test:  Unmatched  Data. 

d.  Treatment  of  Ties,  If  Xs  are  tied  with  Ys,  and  m  and 
n  are  small  enough  for  the  tables  to  apply,  it  is  suggested  that  the 
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Wilcoxon  form  of  the  test  be  used  and  that  ties  be  treated  as  outlined 
in  5,  The  Wilcoxon  Test:  Unmatched  Data. 

If  m  and  n  are  large  enough  to  justify  using  the  normal  ap¬ 
proximation  to  the  distribution  of  U,  a  correction  for  ties  can  be 
applied  in  calculating  the  critical  ratio.  See  "Tables”. 

e.  Efficiency.  See  5,  The  Wilcoxon  Test:  Unmatched  Data. 
Efficiency,  power,  consistency  and  bias  are  same  as  for  the  Wilcoxon 
test  for  \inmatched  data. 

f.  Application,  Let  the  observations  from  the  example  of 
application  of  the  Wilcoxon  test  be  arranged  in  order  of  increasing 
magnitude,  with  the  letter  in  parentheses  indicating  the  sample  from 
which  an  observation  came.  The  result  is  -14(A),  -12(A),  -12(A), 
-10(A),  -3(B),  -2(A),  2(A),  5(B),  7(B),  8(B),  9(B),  15(B),  24(B).  The 
number  of  times  a  B  precedes  an  A  is  2.  A  value  of  U  as  small  or 
smaller  than  this  could  be  obtained  from  the  following  arrangements: 


A 

A 

A 

A 

A 

A 

B 

B 

B 

B 

B 

B 

B 

U  =  0 

A 

A 

A 

A 

A 

B 

A 

B 

B 

B 

B 

B 

B 

U  =  1 

A 

A 

A 

A 

A 

B 

B 

A 

B 

B 

B 

B 

B 

U  =  2 

A 

A 

A 

A 

B 

A 

A 

B 

B 

B 

B 

B 

B 

U  =  2 

,6+7 


Since  there  are  (  ^  )  or  1716  possible  arrcingements,  the  significance 


level  for  a  one-tailed  test  of  the  hypothesis  that  the  A's  either  equal 
or  exceed  the  B's  is  4/1716,  For  a  two-tailed  test,  the  mirror  images 
of  the  four  patterns  shown  above  must  be  considered  as  causing  large 
U's  which  are  correspondingly  "as  extreme".  These  are  the  patterns 
in  which  a  B  follows  an  A  zero,  1,  2,  and  2  times,  or,  to  return  to  the 
definition  of  U,  the  ways  in  which  a  B  precedes  an  A  42,  41,  40  and  40 
times.  Since  there  are  eight  values  of  U  as  extreme  as  that  obtained, 
the  values  being  0,  1,  2,  2,  40,  40,  41,  42,  the  significance  level  for 
a  two-tailed  test  is  8/1716. 


g.  Discussion.  Let  there  be  n  xs  and  m  ys  arranged  in 
order  of  increasing  magnitude.  Let  x.  be  the  i^^  x  in  order  of  in- 

fVi  ^  •  fVi 

creasing  magnitude  and  the  r^"  measurement,  i.  e.  ,  the  r^"  among  the 
xs  and  ys  combined,  in  order  of  increasing  magnitude,  and  let  be 
the  number  of  ys  preceding  x..  Finally  let  T  be  the  Wilcoxon  rank  sum 
of  the  X  ranks  and  let  U  be  the  Mann- Whitney  statistic,  the  number 
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of  times  a  y  precedes  an  x.  Then  r  is  the  Wilcoxon  rank  of  x^  and 

r  =  i  +  u..  And  T  =  Sr  =  .S^  (i  +  u.)  =  n(2-^)  +  u.  =  n  (— 1-1 )  +  U* 
1  1=1  '  1  '  2  '  1=1  1  '  2  ' 


The  sum  of  all  ranks  is  simply  the  number  of  ranks  times 


the  average  rank,  or  (m  +n)  ;  therefore,  T',  the  rank  sum 

of  the  y  ranks  is  (m  +  n)  -  T .  So  T  '  =  (m  +  n)  (  ^  ^ 

-  n  ( ^  2  "  U  which  reduces  to  T '  =  mn  +  ^  ^  ^ ^  ”  U. 


Thus  the  Mann-Whitney  test  statistic  U,  for  any  given  values 
of  m  and  n,  differs  from  the  Wilcoxon  test  statistic,  T,  only  by  a 
constant.  Otherwise  stated,  the  two  statistics  are  mathematically 
equivalent.  The  formulas  relating  T  to  U  may  be  useful  in  saving 
labor  when  tables  are  in  terms  of  U,  since  it  is  generally  easier  to 
obtain  T  than  U  (which  involves  an  excessive  amount  of  counting ).  The 
Mann-Whitney  statistic  is  also  related  to  Kendall's  S  for  rank  corre¬ 
lation. 


Many  of  the  points  discussed  in  connection  with  Fisher's 
Method  of  Randomization  and  the  Wilcoxon  test  are  also  relevant  to 
the  Mann-Whitney  statistic.  They  will  not  be  recapitulated;  there¬ 
fore,  see  the  "Discussion”  section  of  the  foregoing  tests  named. 

h.  Tables.  1,  35,  46  (See  also  18  and  36).  Tables  can  also 
be  found,  reproduced,  in  some  of  the  sources  listed  in  the  Introduction. 

The  number  of  ys  which  either  precede  or  follow  a  given 
X  is  m,  the  size  of  the  y  sample;  and  since  there  are  n  xs,  the 
number  of  ys  either  preceding  or  following  an  x  is  nm.  Therefore 
if  U  is  the  number  of  times  a  y  precedes  an  x,  then  mn  -  U  is  the 
number  of  times  a  y  follows  an  x.  This,  however,  is  also  the 
number  of  times  an  x  precedes  a  y.  Therefore,  the  count  need 
be  made  only  once  even  though  most  tables  list  only  the  smaller  of 
the  two  values  U  and  U'  =  mn  -  U. 

When  m  and  n  are  large  and  are  too  large  for  the  exact 
tables  to  apply,  approximate  probabilities  may  be  obtained  by  re¬ 
ferring  a  critical  ratio  to  normal  tables.  If  there  are  no  ties,  the 
test  statistic  U  comes  from  a  population  of  U's  whose  mean  is  U  =  — 
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Ties  do  not  affect  the 


j  ,  .  .  mn  (m  +  n  +  1) 

and  whose  variance  is  - -  - - 

mean,  but  they  decrease  the  variance  (22),  Let  t.  be  the  number 

th  ^ 

of  tied  observations  in  the  i  group  of  tied  observations,  and  let 
there  be  k  groups.  Then  when  there  are  ties  the  variance  becomes 


2  mn  ^i=l^^i^“V 

“■u  =  ^  t  m  +  n+1  -  ^  *  Probabilities  may 


therefore  be  obtained  by  referring  the  critical  ratio 
normal  tables. 


U  -  U 


to 


u 


i.  Sources.  See  5,  The  Wilcoxon  Test:  Unmatched  Data 
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CHAPTER  VI 


TESTS  BASED  ON  THE  METHOD  OF  RANDOMIZATION  II 


Fisher^s  method  can  be  applied  to  almost  any  type  of  statistic 
or  sample  information.  In  the  present  chapter  it  is  extended  to 
testing  for  correlation,  the  most  significant  such  application  being 
that  in  which  the  method  is  used  to  obtain  exact  tables  for  Spearman^s 
rank  difference  correlation  coefficient. 
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1,  Pitman^s  Correlation  Test 


a.  Rationale,  Suppose  that  an  x  observation  and  a  y  observa¬ 
tion  have  been  made  on  each  of  n  units  or  individuals  and  that  Pearson*s 
product  moment  correlation  coefficient,  r,  has  been  calculated  from 
the  data  in  the  usual  way.  Now  suppose  that  the  correlation  coefficient 
is  calculated  for  every  possible  set  of  paired  xs  and  ys,  using  the  same 
data  but  permitting  any  given  x  observation  to  be  paired  with  any  of  the 
n  y  observations,  not  just  the  one  recorded  for  the  same  unit.  There 
are  n  ways  of  assigning  a  y  to  x^,  n-1  ways  of  assigning  a  y  to  x^  after 

making  the  first  assignment,  etc.  ,  so  that  there  are  in  all  nl  ways  of 
re-pairing  the  xs  and  ys.  Let  N  be  the  number  of  these  ways  which 
result  in  a  correlation  coefficient  as  large  or  larger  than  that  obtained 
for  the  data  as  recorded.  If  there  is  no  correlation  between  x  and  y 
in  the  sampled  population,  then  each  of  the  n!  correlation  coefficients 
is  equally  likely  and  the  a  priori  probability  of  obtaining  a  correlation 
coefficient  as  great  or  greater  than  that  actually  obtained  is  N/n! 

b.  Null  Hypothesis.  Each  of  the  n!  sets  of  pairs  of  xs  and 
ys  is  equally  likely  to  have  been  recorded.  This  will  be  the  case  if 
all  assumptions  are  true  and  if  there  is  no  correlation  between  x 
and  y. 


c.  As  sumptions .  Sampling  is  random,  pairs  of  observations 
are  independent  and  the  sampled  populations  are  continuously  distributed 
so  that  there  are  no  tied  observations. 

d.  Treatment  of  Ties.  K  any  xs  or  ys  are  tied  there  will  be 
less  than  n!  distinguishable  sets  of  pairs.  However,  if  ties  are  due 
to  imprecision  of  measurement,  the  tied  observations  may  be  treated 
as  if  distinguishable,  by  regarding  one  tied  observation  as  *^green*\ 
another  as  **yellow*^  a  third  as  **red’*,  etc.  ,  in  permuting  data,  so 
that  nl  remains  the  proper  denominator  for  the  probability  fraction. 

To  minimize  error,  half  of  the  sets  of  pairs  which,  because  of  ties, 
yield  exactly  the  same  r  as  the  actually  recorded  data  may  be  counted 
as  among  the  N  **as  extreme  or  more  extreme**  sets.  For  a  conser¬ 
vative  test,  N  should  include  all  of  them. 

e.  Efficiency.  No  information  available. 
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n.  Let  the  obtained  data  be  represented  as 

Sum  Mean 

5  8  14  30  6 

7  10  15  35  7 

35  80  210  Vxy  =  329 


X(x  -  x)(y  -  y) 

is  - - — -  -  whose  numerator  is 

-  x)^^(y  -  y)^ 

^xy  -  X  2^y  -  y  ^ ’x+  J  xy  or  J^^xy  -  nxy  -  nyx  +  n  xy  or  simply 

^xy  -  nicy  and  whose  denominator  remains  constant  for  every  set 

of  re-paired  xs  and  ys.  The  N  **most  extreme**  rs  therefore  will  be 
those  which  have  the  N  **most  extreme**  numerators.  The  numerator 
for  the  observed  data  is  329  -  210  or  119.  This  value  can  be  exceeded 
in  only  one  way:  by  switching  the  two  leftmost  ys.  Therefore  for  a 
one-tailed  test  of  the  null  hypothesis  that  there  is  either  zero  or  nega¬ 
tive  correlation,  oc  =  2/51  =  2/120.  For  a  two-tailed  test  N  must 
include  those  sets  of  pairs  for  which  the  numerator  of  r  is  -119  or 
less,  i.e.,  those  sets  for  which  2,xy<  210  -  119  =  91.  In  this 
particular  case  there  are  no  such  sets,  so  the  significance  level 
for  a  two-tailed  test  is  still  oc  =  2/120. 

g.  Discussion.  In  common  with  other  tests  based  on  Fisher^s 
Method  of  Randomization  and  using  original  continuously  distributed, 
measurements,  this  test  is  a  conditional  one.  Strictly  speaking  stat¬ 
istical  inference  can  be  extended  only  to  a  **population**  consisting  of 
the  xs  and  ys  actually  recorded,  not  to  the  larger  population  from 
which  they  were  drawn.  To  the  extent  that  the  obtained  sample  is 
representative  or  typical  of  the  larger  population,  it  would  be  legi¬ 
timate  to  extend  inference  to  the  larger  population.  However,  such 
representativeness  is  not  tested  by  the  test  and  remains  an  unproven 


follows : 


f .  Applicatk 

X  1  2 

y  2  1 

xy  2  2 


The  expression  for  r 
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assumption  for  which  there  is  generally  little  or  no  evidence.  Like¬ 
wise,  because  the  rejection  region  varies  with  the  sample,  it  is  im¬ 
possible  to  construct  generally  useful  tables  of  probabilities  for  the 
test. 


h.  Tables.  There  are  no  tables;  probabilities  must  be 
calculated  for  each  individual  case. 

i.  Sources.  23,  See  also  1. 


2.  The  Rank  Difference  Correlation  Coefficient 


a.  Rationale.  If  an  x  measurement  and  a  y  measurement 
have  been  taken  on  each  of  n  units  or  individuals,  the  Pearson  product 
moment  correlation  coefficient  is 

^(x  -  x)  (y  -  y) 

r  =  — ■  - . .  ■  .  However,  if  the  measurements  are  con- 

y^(x  -  x)*^  J](y  -  y)"^ 

tinuously  distributed  so  that  there  are  no  ties,  and  if  each  measure¬ 
ment  is  replaced  by  its  rank  among  measurements  of  the  same  type, 

the  formula  for  Pearson*s  r  reduces  to  r  -  1  -  -  where  d 

3 

n  -  n 

is  the  difference  between  ranks  of  measurements  taken  on  the  same 
unit  (13).  The  latter  formula  is  the  expression  for  Spearman^s  rank 
difference  correlation  coefficient,  p.  Therefore  if  original  measure¬ 
ments  are  replaced  by  their  ranks,  Pitmaun^s  test,  applying  Fisher’s 
Method  of  Randomization  to  the  product  moment  correlation  coefficient, 
and  the  application  of  Fisher’s  Method  of  Randomization  to  Spearman’s 
rank  difference  correlation  coefficient  are  mathematically  equivalent. 
By  using  ranks,  however,  instead  of  original  measurements,  the  test 
is  no  longer  conditional  upon  the  particular  measurements  recorded, 
the  sample  space  and  the  rejection  region  for  the  test  statistic  are 
the  same  from  one  test  to  another  for  the  same  sample  size  n,  and 
significance  levels  may  profitably  be  tabled. 

Therefore,  let  each  x  measurement  be  replaced  by  its  rank 
among  the  xs,  and  each  y  measurement  by  its  rank  among  the  ys. 
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There  are  nl  ways  of  obtaining  a  sample  of  n  pairs  of  ranks,  each 
pair  containing  an  x  rank  and  a  y  rank.  If  there  is  no  correlation 
between  x  and  y,  each  of  these  nl  samples  was  equally  likely,  on  an 
a  priori  basis,  to  have  been  the  obtained  sample.  Therefore,  if  N 
of  these  n!  samples  yield  a  rank  difference  correlation  coefficient 
as  extreme  or  more  extreme  than  that  calculated  for  the  actually  ob¬ 
tained  sample,  the  probability  for  that  of  the  obtained  sample  is 
N/n! 


Null  Hypothesis,  Each  of  the  n*  sets  of  pairs  of  xs  and 
ys  is  equally  likely  to  have  been  recorded.  This  will  be  the  case  if 
all  assumptions  are  true  and  if  there  is  no  correlation  between  x 
and  y. 


Assumptions,  Sampling  is  random,  pairs  of  observations 
are  independent,  and  the  sampled  populations  are  either  continuously 
distributed  or  are  natural  rank  populations  consisting  of  the  unrepeated 
integers  from  1  to  n  so  that  there  are  no  tied  ranks. 


d.  Treatment  of  Ties,  When  the  same  value  is  recorded 
for  more  than  one  x  observation  or  for  more  than  one  y  observation, 
the  problem  of  ties  is  raised.  It  has  generally  been  recommended 
that  such  ties  be  given  the  midrank  for  the  tied  group  in  which  they 
appear.  However,  Thornton  (31)  has  pointed  out  when  n  **is  very 
small  one  or  more  pairs  of  tie  rankings  will  change  very  greatly  the 


frequencies  with  which  various  values  of 


^  d^  and 


p  can  be  obtained*b 


and  has  questioned  **whether  tie  rankings  tend  to  increase  the  probability 
of  positive  coefficients  and  to  decrease  the  probability  of  negative  coef¬ 
ficients,  **  A  perfect  positive  correlation,  +1,  is  obtained  when 

J'd  =  0.  For  an  n  of  3,  this  can  occur  in  the  following  ways  if  ties 
are  assigned  the  midrank. 


x  1 

112  3 

1  1/2  1  1/2  3 

2  2  2 

1  21/2  21/2 

y 

1  1  2"  3 

1  1/2  1  1/2  3 

2  2  2 

1  2  1/2  2  1/2 

A  perfect  negative  correlation. 


-1,  is  obtained  when 


3 

n  -  n 


3 
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For  an  n  of  3,  the  required  sum  of  squared  differences,  8,  can  occur 
only  for  the  case  of  no  ties: 


X  1  2  3 

j_ 3  2  1 

d  -2  0  2 

d^  4  0  4 


If  any  two  of  the  three  xs  or  of  the  three  ys  are  tied,  the  corresponding 
d^s  will  sum  to  less  than  4  and  the  total  sum  of  d^s  will  be  less  than  8, 


Such  considerations  suggest  that  the  most  reasonable  treat¬ 
ment  of  ties  is  to  distribute  the  tied-for  ranks  among  the  tied  observa¬ 
tions  in  each  group  in  that  way  which  is  least  conducive  to  rejection 
of  the  null  hypothesis.  The  limits  of  **tie  error"  can  be  obtained  by 
calculating  probabilities  under  both  the  above  method  and  the  method 
by  which  tied-for  ranks  are  assigned  to  tied  observations  in  the  way 
most  conducive  to  rejection. 


Efficiency,  Spearma.i*s  rank  difference  correlation 
coefficient  has  an  asymptotic  estimate  efficiency  of  9/  tt  or  .912  as 
an  estimator  of  Fearson^s  product  moment  correlation  coefficient 
when  the  latter  is  zero  and  when  both  coefficients  are  obtained  from 
large  samples  from  a  bivariate  normal  population  (13).  Under  the 
conditions  outlined  above,  therefore,  the  rank  difference  test  for 
correlation  has  an  asymptotic  relative  efficiency  of  .912  relative 
to  the  parametric  test  for  correlation  (27,  26). 


The  test  has  been  shown  by  Hoeffding  (10,  11)  to  be  asymp¬ 
totically  biassed  for  certain  alternatives. 

f.  Application.  Using  the  same  data  used  in  the  example 
of  application  of  Pitman’s  correlation  test  we  have: 
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1  2  5  8  14 

2  1  7  10  15 


X  rank  1 


2  3  4  . 


5 


y  r  ank  2  1 


3  4 


5 


d 


-1+100  0 


d 


2 


1  0  0 


The  value  of  p  for  the  obtained  sample  is  p 


1 


6x2 


125-5 


-  .  90  which  can  be  exceeded  in  only  one  way  —  by  switching  the  y 
ranks  1  and  2  so  as  to  obtain  a  perfect  positive  correlation.  It 
can  be  equalled,  however,  by  any  one  of  the  following  four  ways,  the 
X  ranks  being  listed  only  once  since  re-pairing  can  be  accomplished 
by  manipulating  only  the  ys: 


X 


1  2  3  4  5 


y 


2  13  4  5 


1  3  2  4  5 


y 


1  2  4  3  5 


1  2  3  5  4 


Therefore  for  a  one-tailed  test  of  the  hypothesis  that  correlation  is 

^+15 

either  zero  or  negative, oc=  N/nl  =  .  For  a  two-tailed 

test,  N  must  include  all  sets  of  re-paired  ranks  which  yield  a  p  of 
-.90  or  a  larger  negative  magnitude.  They  are  listed  as  follows: 
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X 

1 

2 

3 

4 

5 

1  -  40 

<y 

y 

5 

4 

3 

2 

)  p  =  -  1 

X 

1 

2 

3 

4 

5  ''N 

y 

4 

5 

3 

2 

1 

y 

5 

3 

4 

2 

1  ' 

1 

^  -  38 

y 

5 

4 

2 

3 

1  ' 

o 

• 

1 

II 

Q. 

y 

5 

4 

3 

1 

2 

It  is  clear  therefore  that  the  test  statistic  is  symmetrically  distributed 
so  that  the  significance  level  for  a  two-tailed  test  is  just  twice  that  for 
a  one-tailed  test,  i,  e.  ,  oc  =  N/nl  =  10/120.  In  actual  application,  of 
course,  the  significance  levels  would  be  obtained  directly  from  tables 
rather  than  by  enumerating  the  number  of  ways  which  constitute  the 
numerator  of  the  probability  fraction. 

g.  Discussion.  It  has  been  forcefully  pointed  out  (22  and 
editorial  note  accompanying  30)  in  the  past  that  correlation  between 
sets  of  ranked  variate  values  is  not  the  same  thing  as  correlation 
between  sets  of  original  variate  measurements.  Recent  results  by 
Stuart,  however,  indicate  that  when  samples  are  of  moderate  or  large 
size,  conclusions  as  to  correlation  among  original  measurements  may 
reasonably  be  drawn  from  tests  of  correlation  which  use  only  the  ranks. 
Stuart  (27,  see  also  15  pp,  124-125)  found  that  when  sample  size  in¬ 
creased  from  25  to  infinity  the  correlation  between  original  measure¬ 
ments  and  their  ranks  increased  from  .94  to  .98,  for  samples  from 
normally  distributed  populations,  and  from  .96  to  1.00  for  samples 
from  uniformly  distributed  populations  with  finite  range. 

There  are,  at  present,  two  outstanding  rank  tests  for  corre¬ 
lation,  the  present  test  and  Kendall^s  rank  order  test  of  correlation. 

The  two  tests  are  not  mathematically  equivalent:  ‘^it  is  possible  to 
have  populations  in  which  t  =  0  and  p  =  1/2  or  -  1/2**  (4).  However, 
when  applied  to  samples  from  bivariate  normal  populations  in  which 
X  and  y  are  uncorrelated,  Spearman’s  p  and  Kendall’s  t  are  highly 
correlated.  For  such  cases,  the  product  moment  correlation  coef- 
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ficient  for  correlation  between  p  and  t  is  •  980  when  n  =  5,  .  990  when 
n  =  20,  and  1.00  when  n  =  oo  (15  p.  80,  6,  5). 

When  applied  to  very  large  samples  from  a  bivariate  normal 
population  in  which  the  population  product  moment  correlation  between 
X  and  y  is  r  ^the  product  moment  correlation  between  Spearman*s  p 
and  Kendall’s  t  is  1  when  r  =  0,  .  9998  when  r  .  2,  .9981  when 

r  =  .4  and  .9843  when  r  =  ^8,  ’’though  it  tends  to  zero  as  r  approaches 
unity"  (15,  p.  131). 

The  rank  difference  correlation  test  has  the  advantage  that  it 
can  be  performed  very  quickly.  Also,  because  rank  differences  are 
squared,  the  test  is  particularly  desirable  when  one  wishes  to  weight 
large  discrepancies  between  ranked  xs  and  ys  more  heavily  than  small 
ones.  In  most  other  respects,  however,  the  test  appears  to  be  in¬ 
ferior  to  Kendall’s  rank  order  correlation  test  (15,  16,  18). 

Both  the  distribution  of  p  and  that  of  t  approach  the  normal 
distribution  as  n  increases  (13,  10,  5).  However,  the  distribution 
of  p  is  inadequately  approximated  by  the  normal  distribution  when 
samples  are  of  a  size  just  too  large  for  the  exact  tables,  which  ex¬ 
tend  from  n  =  2  to  n  =  10,  to  be  applicable.  The  "fit"  between  the 
distribution  of  p  and  its  normal  approximation  is  poor  at  the  most 
important  region,  the  tails,  when  n  is  small,  e.  g.  when  n  ■  11. 
Furthermore,  at  these  small  sample  sizes  the  distribution  of  p  is 
very  jagged  ordinatewise,  presenting  a  sawtoothed  appearance  (15,  16). 
By  contrast,  the  distribution  of  Kendall^s  t  approaches  the  normal 
form  much  more  rapidly  so  that  the  normal  approximation  is  reasonably 
good  at  those  sample  sizes  at  which  it  must  be  used  to  obtain  prob¬ 
abilities.  At  these  sample  sizes  the  distribution  of  t  is  such  that 
the  curve  descends  monotonically  on  either  side  of  its  mode,  the  en¬ 
tire  curve  including  its  tails  giving  the  appearance  of  a  very  nearly 
normal  distribution  (15). 

A  modification  of  the  rank  difference  correlation  test  has  been 
considered  by  Daniels  (4)  as  a  test  for  trend.  It  has  an  asymptotic 
relative  efficiency  of  (3/7r)^^^  or  .98,  relative  to  the  regression  coef¬ 
ficient  test,  b,  as  a  test  of  randomness  against  normal  regression  al¬ 
ternatives.  When  applied  in  these  circumstances,  it  is  equal  in  ef¬ 
ficiency  to  Mann’s  T  test,  and  generally  superior  to  other  distribution- 
free  tests  of  randomness  (29,  26).  See  Table  I  of  the  Introduction. 
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h.  Tables.  Exact  probabilities  have  been  tabled  for  2  <n  <  7 
by  Olds  (20),  for  2  <  n  <  8  by  Kendall,  Kendall  and  Smith  (16),  for 
n  =  9  and  n  »  10  by  David,  Kendall  and  Stuart  (6),  and  for  4  5^510 
by  Kendall  (15).  Approximate  probabilities  have  been  tabled  for 
8  <  n  5  30  by  Olds  (20,  21)  using  a  Type  II  curve  for  8  5n  5  10  and 
using  the  normal  approximation  for  11  =  n  5  30.  All  of  these  tables 


are  entered  with  d  rather  than  p. 
Olds*  tables  of  probabilities  for  ^d 


Thornton  (31)  has  **translated** 
into  probabilities  for  p,  Olds^ 


tables  have  been  criticized  as  containing  distortions  when  sample 
sizes  are  in  the  region  of  n  =  1 1  (31). 


If  there  is  no  correlation  between  ranked  xs  and  ys,  then  as 
n  increases  the  sampling  distribution  of  p  approaches  a  normal  distri¬ 


bution  whose  mean  is  zero  and  whose  variance  is 


n  -  1 


.  Likewise, 


the  sampling  distribution  of  ^^d  approaches  a  normal  distribution 

n3  _  n  n^  2  1 

whose  mean  is - and  whose  variance  is  {— — ■ ) 


Therefore,  for  large  samples,  - B -  or 


Zo 


6  '  n  -  1 

7  3 

2  n  -  n 


3 

n  -  n 


may  be 


treated  as  normal  deviates  with  zero  mean  and  unit  variance,  and  prob¬ 
abilities  may  be  obtained  by  referring  these  critical  ratios  to  normal 
tables.  Various  corrections  to  these  formulae  are  available  which 
•’correct**  for  the  effect  of  ties  (12,  15,  30,  36)  or  for  discontinuity 
(See  15,  pp.  34-35,  38-41,  59-60),  However,  because  of  the  biassing 
effect  produced  by  ties,  the  most  reasonable  procedure  would  appear 
to  be  the  most  conservative  one.  Following  this  philosophy  tied 
observations  would  be  assigned  the  tied-for  ranks  least  conducive  to 
rejection  of  the  null  hypothesis.  Probabilities  would  then  be  obtained 
using  formulae  **uncorrected**  for  ties.  When  there  are  no  ties,  the 


interval  between  successive  values  of 


is  2, 


so  the  appropriate 
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correction  for  continuity  consists  of  subtracting  or  adding  1  to  the 
numerator  of  the  critical  ratio.  If  the  numerator  is  positive,  it 
should  be  decreased  by  1,  if  negative,  increased  by  1. 

i.  Sources,  2-18,  20-22,  25-31,  35,  36. 


3.  Test  for  Serial  Correlation 


Wald  and  Wolfowitz  (32)  have  considered  the  Method, of 
Randomization  as  a  means  of  testing  the  significance  of  the  serial 
correlation  coefficient. 


n 


n 


i=l 


o:  Xi)‘ 

1-1 

n 


(  s"  x,)‘ 

2  i=l 

'i 

i  =  l  n 


7.  X, 


There  are  n!  permututations 


of  the  order  in  which  the  Xs  were  actually  recorded,  and  for  all 

n  ^  ^ 

Z'C'  2 

(and  X.  )  will  be  the  same.  Therefore,  the 

i  =  l  i='l  ^ 


n 

V 

statistic  used  is  simply  X.  the  subscript  i  indicating 

i=  1 

the  i  X  in  order  of  appearance  and  h  indicating  the  "lag"  or  indi¬ 
cating  the  period  of  a  suspected  cyclical  fluctuation.  When  i+h  > 

n,  X.  ,1  is  used  instead  of  X.  .  The  value  of  R,  is  calculated, 
i+h-n  i+h  h 


in  effect,  for  each  of  the  nl  possible  permutations  of  order  (which  are 
equally  probable  if  the  null  hypothesis,  that  the  Xs  are  independent 
observations  from  the  same  population,  and  therefore  appear  in  ran¬ 
dom  order,  is  true).  The  N  "values  which  constitute  the  critical  reginn 
will  depend  in  each  particular  problem  on  the  possible  alternatives  to 
randomness,  "  and  so  will  the  value  of  h.  The  significance  level  is 
N/n!  ,  and  the  null  hypothesis  is  rejected  if  the  actually  obtained 
value  of  is  among  the  N  values  of  Rj^  which  constitute  the  critical 
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region.  It  is  assumed  that  the  Xs  come  from  a  continuously  distrib 
uted  population. 


The  value  Rj  is  asymptotically  normally  distributed  (under 
mild  qualifications)  and,  if  h  is  prime  to  n,  the  distribution  of 
n  n 

Rh  =  ^i+h  same  as  the  distribution  of  “  Z.  X^  ^i+1* 

i=l  i=l 


Therefore,  by  taking  h  and  n  so  that  h  is  prime  to  n,  the  significance 
of  can  be  tested,  for  large  samples,  by  referring  the  critical  ratio 

Rl  -Ri 

_  to  normal  tables.  Unfortunately,  considerable  calculation 


R 


1 


is  required  to  obtain  the  mean  and  variance  of  R^^, 


The  authors  have  suggested  that  the  test  might  be  improved 
if  the  Xs  were  replaced  by  their  ranks.  Noether  (19)  finds  that  both 
tests,  i.  e.,  the  one  already  outlined  and  the  one  in  which  Xs  are  re¬ 
placed  by  their  ranks,  are  consistent  against  certain  alternatives  of 
cyclical  trend  where  h  is  the  length  of  cycle.  He  finds  that  either  test 
may  have  the  greater  asymptotic  relative  efficiency  with  respect  to  the 
other,  depending  upon  the  distribution  of  the  population  of  Xs,  The 
asymptotic  relative  efficiency  of  the  Rj^  test  relative  to  Mannas  T  test 
was  found  by  Noether  to  be  zero  under  certain  stated  conditions.  It 
also  has  A.R.E,  of  zero  relative  to  the  best  parametric  test  based  on 
the  regression  coefficient  (29,  26). 
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CHAPTER  Vn 


TESTS  BASED  UPON  INVERSIONS 


Correlation  can  be  tested  by  arranging  units  in  increasing  order 
of  one  variable  and  testing  the  resulting  order  of  the  other  variable 
for  randomness.  If  there  are  n  units  and  there  is  no  correlation, 
the  resulting  sequence  of  observations  on  the  second  variable  is 
equally  likely  to  be  any  of  the  n!  possible  permutations  of  the  obser¬ 
vations.  However,  if  the  two  variables  are  linearly  correlated,  the 
observations  on  the  second  variable  should  tend  to  form  an  increasing 
or  decreasing  sequence,  and  the  number  of  inversions  in  this  sequence 
should  tend  to  be  extreme.  By  using  the  number  of  inversions  as  test 
statistic  and  applying  essentially  Fisher's  Method  of  Randomization, 
an  exact  test  for  correlation  can  be  formed  and  its  probabilities 
tabled.  By  taking  *'time’*  as  the  first  variable,  the  test  can  be  made 
a  test  for  trend. 
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1. 


The  Distribution  of  Inversions 


Let  the  integers  from  1  to  n  be  arranged  in  some  order,  such 
as  the  following:  3  5  1  4  2  6.  When  a  given  number  is  followed 

by  a  smaller  number  an  inversion  exists.  In  the  sequence  of  integers 
just  presented,  there  are  six  inversions:  3  is  followed  by  two  smaller 
numbers  1  and  2;  5  is  followed  by  three  smaller  numbers  1,  4  and  2; 

and  4  is  followed  by  the  smaller  number  2. 

If  the  order  in  which  the  n  integers  are  to  be  arranged  is 
determined  by  a  random  process,  then  each  of  the  n!  permutations 
of  the  n  integers  is  equally  probable.  And  the  a  priori  probability 
of  obtaining  a  random  sequence  with  exactly  I  inversions  is  simply 
the  number  of  permutations  containing  exactly  I  inversions  divided  by 
n!  ,  the  number  of  permutations  possible. 


Besides  I,  two  additional  measures  directly  related  to  inver¬ 
sions  will  be  encountered.  For  a  single  permutation  the  maximum 
number  of  inversions  is  simply  the  number  of  pairs  of  integers  which 

are  compared,  (2)  ^  Therefore  the  number  of  times  an 


integer  is  followed  by  a  larger  integer  in  the  sequence  is  the  compli¬ 


ment  of  I  and  is  equal  to  ^  (n-1)  -  I. 


This  measure  will  be  designated 


as  T.  The  other  measure  is  S  which  is  equal  to  T  -  I.  The  following 
table  gives  the  distributions  of  I,  T  and  S  for  n  *  4. 


The  distribution  of  I  has  mean  —  (n-l)  and  variance  ^ 

4  '  '  72 


and  as  n  approaches  infinity  it  approaches  the  normal  distribution  (8, 
40,  54).  Therefore,  for  large  n  the  critical  ratio 


I  -  7  (n-l) 
4 


/n(n-l)(2n+5) 

TZ 


may  be  treated  as  a  normal  deviate. 
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TABLE  V 


DISTRIBUTION  OF  INVERSIONS  FOR  n 


Permutation 


Permutation 


1 

2 

3 

4 

0 

6  ’ 

^  i 

3 

1 

2 

4 

2  ! 

4 

2 

1 

2 

4 

3 

1 

5  , 

4  , 

3 

1 

4 

2 

3  j 

3 

0 

1 

3 

2 

4 

1 

5 

4 

3 

2 

1 

4 

3  , 

3 

0 

1 

3 

4 

2 

2 

4 

2  . 

3 

2 

4 

1 

4  : 

2 

-2 

1 

4 

2 

3  ' 

2 

1  4 

2  1 

3 

4 

1 

2 

4 

2 

-2 

1 

4 

3 

2 

3  1 

1  3  ' 

0  j 

3 

4 

2 

1  ! 

5  ! 

1 

-4 

2 

1 

3 

4 

J 

1  i 

i  5 

4 

4 

1 

2 

3  i 

i 

3 

3 

0 

2 

1 

4 

3  ' 

2 

4  ' 

2  i 
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3 

2  i 

4  i 
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-2 
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4  ' 
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2.  Kendall’s  Rank  Order  Correlation  Test 


a.  Rationale.  Suppose  that  an  x  measurement  and  a  y 
measurement  have  been  taken  on  each  of  n  units  and  that  tied  xs  and 
tied  ys  are  both  impossible.  If  the  units  are  arranged  from  left 
to  right  in  order  of  increasing  x  scores,  the  sequence  of  ys  will 
be  random  if  x  and  y  are  uncorrelated.  However,  if  x  and  y  are 
linearly  correlated,  the  sequence  of  ys  will  tend  to  increase  or  de¬ 
crease  systematically,  and  the  number  of  inversions  among  the  ys 
will  tend  to  be  small  or  large  respectively,  Therefore  the  number 
of  inversions  among  the  ys  can  be  used  to  test  the  null  hypothesis 
that  X  and  y  are  randomly  associated  against  the  alternative  that 
they  are  linearly  correlated. 


JLet  the  xs  be  ranked  from  1  to  n  and  the  ys  also,  and  let  the 
units  be  arranged  in  increasing  order  of  x  rank.  Then  if  T  is  the 
number  of  times  a  y  rank  is  followed  by  a  larger  y  rank  and  I  is  the 
number  of  times  a  y  rank  is  followed  by  a  smaller  y  rank,  Kendall’s 


test  statistic  is  S  =  T  -  I.  Since  T  =  ^  (n-1)  -  I,  S  =  ^  (n-1)  -  2  I 


-j-  J-i  -  1,  “  *2“ 

or  =  2  T  -  ^  (n-1),  so  S,  I  and  T  are  mathematically  equivalent  test 
statistics  (when  there  are  no  tied  scores). 


The  xs  need  not  actually  be  arranged  in  order  of  increasing 
magnitude  in  order  to  calculate  S.  It  is  obvious  from  the  foregoing 

that  S  is  simply  the  number  of  the  (2)  pairs  of  units  in  which  the  x  and 

y  scores  of  one  member  deviate  in  the  same  direction  from  their 
respective  x  and  y  counterparts  in  the  other  member  minus  the  number 
of  pairs  in  which  they  deviate  in  opposite  directions.  Therefore,  let 
the  units  be  arranged  in  any  arbitrary  order  and  let  subscripts  indi¬ 
cate  position  in  this  order,  unit  j  being  any  unit  to  the  right  of  unit  i. 
Let  a_  be  a  dummy  score  which  is  +1  if  x^  is  greater  than  x^  and  -1 

if  X.  is  less  than  x..  Similarly  b..  is  -fl  ify.  >y.  and  -1  if  y.  <  y.* 

J  1  ij 

Finally,  let  c..  =  a.,  b..  so  c..  is  +1  if  (x.  -  x.)  (y.  -y.)  is  positive, 

^  ij  ij  ij  ij  '1  y  '^1  ^y  ^ 

i.  e.  ,  if  either  x.  <  x.  and  y .  <  y.  or  x.  >  x.  and  y.  >  y and  is  - 1  if 
1  J  1  J 

the  product  is  negative.  Then  S  is  the  sum  of  the  c^^s  taken  over 
all  values  of  j>i  and  all  values  of  i  from  1  to  n. 
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b.  Null  Hypothesis.  Each  of  the  n!  possible  permutations 
of  rank  order  of  the  ys  was  equally  likely,  before  sampling,  to  be 
found  when  the  units  are  arranged  in  order  of  increasing  rank  on 
the  X  measurement.  A  sufficient  condition  for  the  validity  of  the 
null  hypothesis  is  that  x  and  y  are  uncorrelated  and  all  assump¬ 
tions  are  true, 

c.  Assumptions,  The  units  have  been  drawn  independently 
and  at  random  from  a  population  in  which  each  variable,  x  and  y, 

is  either  continuously  distributed  or  exists  naturally  in  the  form 
of  untied  ranks. 

d.  Treatment  of  Ties.  If  ties  are  due  to  imprecision  of 
measurement,  the  safest  rule  is  probably  to  distribute  the  tied-for 
ranks  to  the  tied  measurements  in  the  way  least  conducive  to  rejec  ¬ 
tion  of  the  null  hypothesis.  The  limits  of  tie-error  can  be  obtained 
by  comparing  the  probability  obtained  in  this  manner  with  that  ob¬ 
tained  by  taking  the  opposite  course.  This  rule  may  be  safely  fol¬ 
lowed  regardless  of  whether  exact  or  normal  tables  are  used  and 
without  recourse  to  extensive  corrections  in  formulae  or  to  modi¬ 
fications  of  procedure.  An  alternative  method  is  to  give  observa¬ 
tions  in  each  group  of  tied  values  the  average  of  the  ranks  the  mem¬ 
bers  of  the  group  would  have  if  distinguishable.  The  midrank  method, 
however,  requires  considerable  qualification  as  will  be  shown  in  the 
paragraphs  to  follow. 

When  ties  are  assigned  the  midrank  and  probabilities  are 
obtained  from  tables  constructed  upon  the  assumption  that  ties  are 
impossible,  the  obtained  probabilities  are  distorted;  however,  the 
statistic  S  is  a  far  safer  one  than  T  or  I.  Consider  first  the  case 
where  ties  are  due  to  imprecision  of  measurement.  If  the  ys  are 
arranged  in  order  of  increasing  x-rank  and  the  last  three  of  four 
y-ranks  are  tied,  the  y-ranks  are  1  3  3  3,  so  T  =  3,  I  -  0  and 
S  =  3.  The  true  ranking  of  the  ys  could  be  any  of  the  following 
permutations : 
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Permutation 
12  3  4 

12  4  3 

13  2  4 

13  4  2 

14  2  3 

14  3  2 


T 

6 

5 

5 

4 

4 

3 


1 
0 
1 
1 

2 
2 
3 


5 

6 
4 
4 
2 
2 
0 


The  average  T  is  4  l/2  but  the  value  of  T  obtained  by  the  midrank 
method  lies  at  one  end  of  the  range  of  possible  true  values.  The  situa¬ 
tion  for  I  is  analogous.  However,  the  average  S  is  precisely  the 
value  obtained  by  using  the  midrank.  (The  average  S,  however,  is 
an  odd  number,  whereas  for  n  »  4  when  there  are  no  ties  S  assumes 
only  even  values.  The  probability  tables  therefore  will  have  no 
entry  for  S  «  3  and  another  source  of  inexactitude  will  have  arisen.  ) 

Consider  now  the  case  where  ties  represent  intrinsic 
equality  rather  than  imprecision  of  measurement.  In  this  event, 
the  proper  tables  are  those  based  upon  the  frequency  distribution 
of  S  given  that  certain  ties  exist,  such  as  the  tables  prepared  by 
Sillitto  (43).  The  appropriate  tables,  therefore,  would  be  derived 
by  obtaining  the  frequency  distributions  of  S  when  each  ranking  con¬ 
tains  specified  numbers  of  ties  of  specified  extents,  obtaining  each 
such  distribution  by  letting  the  y  rankings  assume  every  distinguish¬ 
able  permutation  while  the  x  ranking  is  held  constant.  The  conven¬ 
tional  tables,  derived  from  untied  rankings,  are  not  appropriate 
and,  if  used  in  lieu  of,  or  in  the  absence  of,  the  proper  tables,  may 
lead  to  gross  errors  in  probabilities.  The  amount  of  error  attend  ¬ 
ant  upon  this  procedure,  however,  is  not  the  same  for  T,  I  and  S. 
These  three  statistics  are  mathematically  equivalent  when  ties  are 
impossible,  but  not  otherwise.  When  ties  exist  the  maximum 
value  T  and  I  can  assume  is  reduced,  but  the  minimum  value  is 
the  same  as  if  ties  were  impossible.  Since  S  is  the  difference  be¬ 
tween  T  and  I,  and  since  T  is  inversely  related  to  I,  S  can  assume 
neither  the  same  maximum  nor  the  same  minimum  as  it  could  if 
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ties  were  absent.  The  result  is  that  the  distributions  of  S  when 
there  are  ties  tend  to  maintain  symmetry  about  the  same  point  as 
that  about  which  the  distribution  of  S  is  symmetrical  when  ties  are 
impossible.  And  since  it  is  the  extreme  **tabled**  values  which  be¬ 
come  impossible  when  there  are  ties,  the  true  probability  of  the 
central  Ss  tends  to  gain  at  the  expense  of  the  extremes.  Therefore 
the  error  of  referring  S  to  tables  based  on  the  assumption  of  no  ties 
is  likely  to  be  a  decrease  in  the  probability  of  rejection,  and  the  error 
will  tend  to  be  a  ^^conservative’*  one.  Furthermore  the  error  tends 
to  be  no  greater  for  a  one-tailed  than  for  a  two-tailed  test.  The  dis¬ 
tributions  of  T  and  I  when  there  are  ties  tend  to  occupy  a  region 
closer  to  their  minimal  values  than  is  the  case  when  there  are  no 
ties.  This  distribution  may  be  quite  skewed,  and  even  if  it  is  sym¬ 
metrical,  the  point  of  symmetry  is  closer  to  the  minimal  value  than 
is  the  case  when  there  are  no  ties.  The  result  is  that  the  true  prob¬ 
ability  of  the  smallest  values  of  T  or  I  tend  to  be  much  greater  than 
that  obtained  from  tables  based  on  no  ties,  thus  spuriously  increasing 
the  probability  of  rejection  when  the  rejection  region  consists  of  the 
smallest  values.  The  situation  is  improved  by  using  a  two-tailed 
test,  but  the  error  may  still  be  great  in  the  direction  of  spurious 
rejection.  The  obvious  conclusion  is  that,  while  there  is  no  choice 
between  T,  I  and  S  when  ties  are  impossible,  S  is  a  much  safer  test 
statistic,  although  by  no  means  free  from  error,  when  ties  are  present 
either  because  of  imprecision  of  measurement  or  because  of  intrinsic 
equality. 


If  ties  result  from  intrinsic  equality  between  scores,  the 
tie  is  not  an  artifact  of  measurement,  but  represents  a  fundamental 
discrepancy  between  the  mathematical  model  and  the  situation  it  is 
intended  to  simulate.  For  such  cases  it  is  reasonable  to  alter  the 
mathematical  model.  Sillito  (43)  has  followed  essentially  this  pro¬ 
cedure  by  obtaining  the  exact  distribution  of  S  when  there  are 
groups  of  two  tied  scores  and  groups  of  three  tied  scores  in  one 
of  the  two  rankings,  the  other  ranking  being  tieless.  He  has  tabled 
the  probability  of  S  for  all  possible  values  of  p^  and  p^  (and  for  all 
combinations  thereof),  from  zero  to  the  maximum  number,  for 
3  <n  <  10.  These  probabilities  are  conditional  probabilities:  they 
state  the  probability  of  S  given  that  one  ranking  is  tieless  and  the  other 
contains  P2  groups  of  two  tied  ranks  and  P3  groups  of  three  tied  ranks. 
When  there  are  ties  and  they  are  assigned  the  midrank, the  mean  of  S 
remains  zero,  but  its  variance  is  altered.  The  formula  for  the  var¬ 
iance  of  S  when  there  are  ties  in  one  or  both  rankings  has  been  obtained 
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TABLE  VI 


Conditional  Frequency  Distributions  of  T,  I  and  S  When  n=4  and  There 
Are  no  Ties,  One  Tied  Pair  in  One  Ranking,  Two  Tied  Pairs  in 
One  Ranking,  and  One  Tie  of  Three  Ranks  in  One  Ranking 


Frequency  Distributions  of  T,  I  and  S  if: 
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TABLE  Vn 


Cumulative  Probability  Distributions  for  T,  I  and  S  When  n  =  4  and 
There  Are  no  Ties,  One  Tied  Pair  in  One  Ranking,  Two  Tied  Pairs 
in  One  Ranking,  and  One  Tie  of  Three  Ranks  in  One  Ranking 


True  Distribution  if 


"Tabled" 
Distribution: 
Value  of  Ties  are 

Statistic  Impossible 


Two  Ranks 
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TABLE  Vm 


Conditional  Frequency  Distributions  of  T,  I  and  S  When  n  *  4  and  There 
Are  no  Ties  in  Either  Ranking  and  When  n  =  4  and  There  Are  Three  Tied 
Ranks  in  One  Ranking  and  Either  Two  Tied  Ranks,  Two  Sets  of  Two  Tied 
Ranks,  or  Three  Tied  Ranks  in  the  Other  Ranking 

Frequency  Distribution  of  T,  I  and  S  if 


Value  of 
Statistic 
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TABLE  IX 


Cumulative  Probability  Distributions  for  T,  I  and  S  When  n  =  4  and 
There  Are  No  Ties  in  Either  Ranking  and  when  n  =  4  and  There  Are 
Three  Tied  Ranks  in  One  Ranking  and  Either  Two  Tied  Ranks,  Two 
Sets  of  Two  Tied  Ranks,  or  Three  Tied  Ranks  in  the  Other  Ranking 


Value  of 
Statistic 
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(23,  27,  43,  56.  See  also  52  for  variance  of  T).  Therefore  when 
samples  are  large  and  ties  are  given  the  midrank,  the  significance 
of  S  can  be  obtained  by  referring  the  critical  ratio,  based  upon  the 
'*corr ected"  variance,  to  normal  tables.  Again,  however,  the 
probability  obtained  is  conditional  upon  the  existence,  in  the  popu¬ 
lation,  (either  the  population  of  original  scores  or  the  correspond¬ 
ing  population  of  measurements)  of  ties  in  precisely  the  number 
and  extent  implied  by  the  corrected  variance  formula,  e.  g.  ,  in 
the  same  proportionate  number  and  extent  as  exists  in  the  obtained 
samples.  Two  further  disadvantages  are  that  the  corrected  var¬ 
iance  formula  is  a  long  one  and,  when  a  critical  ratio  based  upon 
it  is  referred  to  normal  tables,  the  correction  for  continuity  de¬ 
pends  upon  which  of  several  tie  situations  exists  and  may  not  be 
precisely  determinable. 


When  one  takes  a  ranking  containing  ties,  resolves  the 
ties  in  all  possible  ways  and  calculates  S  for  each  way,  the  average 
of  these  Ss  is  the  same  as  the  S  obtained  by  the  midrank  method. 
However,  if,  following  Muhsam  (36),  one  takes  an  untied  ranking 
or  pair  of  rankings,  introduces  all  possible  ties,  calculates  S 
for  each  case,  and  obtains  the  average  S  it  is  extremely  unlikely 
to  be  the  same  as  the  S  for  the  untied  rankings,  and  the  distribu¬ 
tion  of  S^s  is  likely  to  be  quite  skewed.  In  the  former  case  one  is 
dealing  with  a  single  conditional  distribution  since  the  number, 
extent  and  location  of  the  tied  groups  is  specified  and  fixed.  In 
this  case,  if  the  null  hypothesis  is  true,  each  of  the  untied  rankings 
which  might  have  been  the  true  ranking  is  equally  likely  to  have  been 
obtained  as  an  untied  ranking  and  therefore  should  be  equally  likely 
to  be  the  true  **parent"  of  the  tied  ranking.  In  the  latter  case,  how¬ 
ever,  the  situation  is  quite  different.  One  is  dealing  with  a  multi¬ 
plicity  of  conditional  distributions,  e.  g.  in  a  ranking  of  four  objects, 
the  distribution  of  S  conditional  upon  the  existence  of  one  tie  of  two 
rankings,  the  remaining  two  rankings  being  untied,  or  the  distribu¬ 
tion  of  S  conditional  upon  the  existence  of  one  tie  of  four  rankings, 
etc.  To  calculate  S  for  all  distinguishable  rankings  under  all  pos¬ 
sible  tying  situations  and  then  take  the  average  S  by  summing  the 
individual  Ss  and  dividing  by  their  total  number  is  implicitly  to 
assume  that  each  such  component  S  is  equally  probable.  This  in 
turn  introduces  the  assumption  that  each  tying  condition's  relative 
probability  is  in  proportion  to  the  number  of  distinguishable  rankings 
which  can  be  obtained  from  it.  It  can  easily  be  shown  that  t|iis 

assumption  is  false.  In  the  example  just  given,  there  are  — ^dis- 

2  • 
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tinguishable  permutations  of  four  ranks  of  which  a  specified  two  are 
tied,  and  these  two  can  be  any  one  of  three  pairs.  So  the  total  num¬ 
ber  of  distinguishable  permutations  of  four  ranks,  two  of  which  are 

4] 

tied  is  3(-^,  )  or  36.  However,  all  four  ranks  can  be  tied  in  only 

one  way,  so  the  ratio  of  the  number  of  distinguishable  rankings 
when  there  is  one  tie  of  two  ranks  to  the  number  when  there  is  one 
tie  of  four  ranks  is  36,  i.e.  ,  the  ratio  is  a  constant  for  the  case 
rmder  consideration.  Now  let  p  be  the  unknown  probability  that  a 
rank  is  tied  with  the  "truly"  next  higher  rank  and  let  q  =  1  -  p  be 
the  probability  that  it  is  not.  A  single  tie  of  two  ranks  can  be  ob¬ 
tained  in  the  following  ways  and  with  the  following  probabilities; 
rank  1  is  tied  with  rank  2  but  rank  2  is  not  tied  with  rank  3  and  rank 

2 

3  is  not  tied  with  rank  4  (Pr  =  p  q  ),  rank  1  is  not  tied  with  rank  2, 

2 

rank  2  is  tied  with  rank  3,  but  rank  3  is  not  tied  with  rank  4  (Pr=p  q  )i 
r  ank  1  is  not  tied  with  rank  2,  rank  2  is  not  tied  with  rank  3,  but  rank 

2 

3  is  tied  with  rank  4  (Pr  =  p  q  ).  The  probability  of  a  single  tie  of 

2 

two  ranks  is  therefore  3pq  .  All  four  ranks  can  be  tied  in  one  way: 
if  rank  2  is  tied  with  rank  1  and  rank  3  with  rank  2  and  rank  4  with 

3 

rank  3  (Pr  =  p  ).  The  ratio  of  the  probability  for  a  tie  of  two  to 

that  of  a  tie  of  four  is  -  ^  >  is  a  variable  which  depends  upon 

the  unknown  probability^  p,  that  a  rank  will  be  tied  with  the  "truly** 
next  higher  rank.  Thus  the  distribution  of  S  when  all  possible  ties 
have  been  introduced  in  the  rankings  lacks  meaning  because  the  various 
rankings,  so  obtained,  are  not  all  equally  probable  when  the  null  hypo¬ 
thesis  is  true. 

e.  Efficiency.  When  applied  to  samples  of  infinite  size 
from  bivariate  normal  populations  in  which  x  and  y  are  uncorre¬ 
lated,  Kendall's  tau  is  perfectly  correlated  with  Spearman's  ^ho  ^ 

(5,  9,  Z3).  Therefore  the  asymptotic  estimate  efficiency  of  9/?: 
or  .  912  for  rho  as  an  estimator  of  Pearson's  product  moment  cor¬ 
relation  coefficient,  when  both  coefficients  are  obtained  from  large 
samples  from  a  bivariate  normal  population,  applies  equally  to 
tau  (47).  Under  the  conditions  stated,  therefore,  Kendall's  rank 
order  test  for  correlation  has  an  asymptotic  relative  efficiency  of 
.  912  relative  to  the  parametric  test  for  correlation  (17,  32). 

The  test  has  been  shown  to  be  consistent  under  conditions 
stated  by  Mann  (29)  and  Terpstra  (52).  Conditions  for  its  un¬ 
biassedness  have  been  given  by  Mann  (29). 
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f.  Application.  Designating  units  by  letters,  let  the  follow¬ 
ing  data  represent  the  variate  values  of  x  and  y  on  each  unit: 


UNIT 

A 

B 

C 

D 

E 

X 

177 

41 

39 

150 

99 

Measures 

y 

84 

4 

7 

53 

16 

Replacing  variate  value 

s  by  their 

ranks, 

the  data  become: 

UNIT 

A 

B 

c 

D 

E 

x-rank 

5 

2 

1 

4 

3 

Measures 

y-rank 

5 

1 

2 

4 

3 

One  method  of  calculating  S  does  not  require  putting  one  ranking 
in  the  "natural**  order,  1,  Z,  3,  etc.  If  there  are  ties  in  both  rank¬ 
ings  and  if  ties  are  given  the  midrank  this  method  should  be  used. 

For  each  of  the  (2)  possible  pairs  of  units,  a  +1  is  scored  if  the 

X  and  y  of  one  unit  deviate  in  the  same  direction  from  their  respec¬ 
tive  counterparts  of  the  other  unit,  i.  e.  ,  if  they  are  both  higher  or 
both  lower  than  their  counterparts  in  the  other  unit,  otherwise,  if 
the  deviations  are  in  opposite  directions,  a  -  1  is  scored.  The  sum 

of  the  (2)  plus  or  minus  Is  is  S,  {If  ties  are  given  the  midrank,  the 

pairs  for  which  the  x  ranks,  the  y  ranks,  or  both,  are  tied  are  given 
a  zero  score.  )  For  example,  for  the  comparison  involving  units  C 
and  E  the  x  rank  of  unit  E  is  greater  than  the  x  rank  of  unit  C  and 
the  y  rank  of  unit  E  is  also  greater  than  the  y  rank  of  unit  C,  There-, 
fore  a  score  of  +1  is  recorded  for  this  comparison.  When  unit  C  is 
compared  with  unit  B  the  x  rank  of  B  is  the  greater  of  the  two  x  ranks 
while  its  y  rank  is  the  lesser  of  the  two,  so  a  -1  is  recorded.  Of  the 
ten  possible  comparisons  of  pairs  of  units,  nine  result  in  a  score  of 
+  1  and  one  in  a  score  of  - 1 .  So  the  algebraic  sum  S  =  +8. 
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If  none  of  the  x  ranks  are  tied,  the  units  may  be  arranged 
in  order  of  increasing  x  rank  thus  simplifying  the  calculation  of  S 
for  now  for  any  pair  of  units  a  score  of  +1  results  if  the  y  rank  of 
the  unit  ‘'higher”  in  the  series  is  greater  than  the  y  rank  of  the  lower 
unit,  and  a  score  of  -1  results  in  the  opposite  case.  Rearranging 
the  units  so  that  the  x  ranks  form  an  increasing  sequence,  the  data 
appear  as  follows: 


UNIT  C 


BED 


A 


X  -  r  ank  1  2 

Measures 

y-rank  2  1 


3  4  5 

3  4  5 


And  the  calculation  of  S  is  shown  in  the  following  table: 


y  rank 
2 
1 

3 

4 

Sum 

S  =  9  -  1  » 


Number  of  larger 
y  ranks  following 

3 

3 

2 

1 

9 


Number  of  smaller 
y  ranks  following 

1 

0 

0 

0 

1 


The  five  y  ranks  can  be  permuted  in  51  or  120  ways  of  which  the 
following  permutations  result  in  an  S  as  great  or  greater  than  +8: 
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1 

2 

3 

4 

5 

S  = 

+  10 

2 

1 

3 

4 

5 

S  = 

+  8 

1 

3 

2 

4 

5 

S  = 

+  8 

1 

2 

4 

3- 

5 

S  = 

+  8 

1 

2 

3 

5 

4 

S  = 

+  8 

Therefore  for  a  one-tailed  test  of  the  null  hypothesis  that  the  x  and  y 
ranks  have  either  zero  or  negative  correlation,  the  significance  level 
is  oc  =  5/120.  For  a  two-tailed  test  of  the  hypothesis  of  zero  rank  con 
lation  ,  the  permutations  yielding  an  S  of  -  8  or  less  must  also  be 
considered.  They  are: 

5 

4 

3 

2 

1 

S  = 

-10 

5 

4 

3 

1 

2 

S  - 

-  8 

5 

4 

2 

3 

1 

S  = 

-  8 

5 

3 

4 

2 

1 

S  = 

-  8 

4 

5 

3 

2 

1 

S  = 

-  8 

So  oc  =  10/120  for  the  two-tailed  test. 

In  actual  application,  the  significance  levels  would  be  obtained 
from  tables.  Furthermore  it  is  clear  that  the  value  of  S  can  be  ob¬ 
tained  directly  from  the  variate  values  of  x  and  y  without  first  convert¬ 
ing  these  values  into  ranks.  Conversion  into  ranks  has  two  advantages 
however.  It  makes  the  counting  process  simpler  and  therefore  reduces 
the  likelihood  of  computational  error.  And  it  serves  as  a  reminder 
that  it  is  rank  correlation  which  is  being  tested,  not  correlation  among 
variate  values.  (The  test  could  be  used  for  the  latter  purpose,  but 
only  as  a  conditional  test,  i.e.  ,  its  conclusions  would  be  restricted 
to  the  set  of  observations  obtained  in  the  sample  and  could  not  be  ex¬ 
tended  to  the  sampled  population.  ) 
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g.  Discussion,  Kendall's  rank  order  correlation  test  is 
one  of  the  most  important  distribution-free  tests.  It  is  equalled  in 
efficiency  and  excelled  in  speed  of  computation  by  Hotelling  and 
Pabst's  rank  difference  correlation  test  based  on  Spearman's  rho. 
However,  in  most  other  respects  it  is  the  better  test  (23,  28,  34). 
(For  a  comparison  of  the  two  tests  see  Chapter  VI.)  As  in  the 
case  of  the  test  based  on  rho,  a  coefficient  of  correlation  can  be 
calculated  from  the  data  used  in  Kendall's  test.  The  maximum 
value  S  can  attain  is  simply  the  number  of  comparisons  of  pairs. 


(?)  or  ^  (n-1).  Kendall  therefore  defines  -  to  be  his 

J(n-l) 

coefficient  of  rank  correlation,  which  he  calls  tau.  Its  value 
ranges  from  -1  for  perfect  negative  correlation  to  +1  for  perfect 
positive  correlation.  Tau  is  related  to  rho,  not  directly,  but  by 

certain  mathematical  inequalities,  e.g.  ,  -l<3T-2p^l, 

"When  the  sample  is  permuted  in  all  possible”  ways",  Daniels  (5) 


finds  the  correlation  between  t  and  p  to  be 


2(n+l) 
n/  2n  (2n+5) 


Kendall's  statistic  has  many  interesting  properties.  Moran 
(33)  has  shown  that  S  is  directly  related  to  the  "least  number  of  inter¬ 
changes  of  neighbors  required  to  restore  the  permutation  to  the 
normal  order",  i.  e.  ,  the  order  1,  2,  3,....,  n-1,  n.  Ifiis  the 


least  number 


of  such  interchanges  required. 


then  i  = 


n(n-l)  -  2S 
i 


A  number  of  statistical  tests  may  be  regarded  as  the  form  which  would 
be  assumed  by  Kendall's  test  if  it  were  modified  to  take  account  of 
ties  representing  intrinsic  equality.  For  example,  the  Mann- Whitney 
test  statistic  U  is  the  number  of  times  an  A- sample  observation  pre¬ 
cedes  a  B- sample  observation  when  the  observations  from  both  sam¬ 
ples  are  arranged  in  order  of  increasing  magnitude  irrespective  of 
sample.  U  therefore  may  be  regarded  as  Twith  increasing  rank 
order  of  magnitude  for  the  combined  sample  as  the  x  ranking  and 
with  the  y  ranking  consisting  of  a  set  of  tied"  A's  intermixed  with 
a  set  of  "tied"  B's.  In  this  case,  the  "ties"  would  be  due  to  in¬ 
trinsic  equality,  i.e.  ,  the  "tied  ranks"  would  define  a  category 
(sample  A  or  sample  B)  and  "rank"  would  have  no  quantitative  mean¬ 
ing  for  the  ys. 


The  conditional  probability  of  T  given  that  one  ranking 
contained  two  sets  of  ties,  one  of  extent  a,  the  other,  b  =  n  -  a, 
would  therefore  be  identical  to  the  probability  of  a  U  of  the  same 
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value,  U  =T,  when  sample  A  contained  a  observations  and  sample  B 
contained  b  observations,  no  observations  being  tied.  It  should  be 
emphasized,  however,  that  this  conditional  probability  of  T  is  obtain¬ 
able  from  the  U  tables  but  not  from  the  tables  for  T  or  S  which  are 
derived  under  the  assumption  of  no  tied  ranks.  That  is  to  say,  the 
proper  tables  for  S,  when  there  are  ties,  are  those  calculated  from 

the  conditional  distribution  of  S  given  that  so  many  groups  of  so  many 
tied  observations  are  present  in  the  data.  The  situation  is  analogous 
when  probabilities  are  obtained  from  the  normal  approximation.  In 
that  case  the  standard  deviation  used  as  the  denominator  of  the  criti¬ 
cal  ratio  must  be  the  square  root  of  the  conditional  variance  of  S 
given  that  certain  ties  have  occurred.  The  formulae  for  the  ’’correct¬ 
ed’*  variance  of  S  may  become  quite  formidable  as  for  instance  in  the 
case  that  there  are  ties  in  both  rankings.  Therefore,  the  relationship 
between  Kendall’s  S  ’’when  there  are  ties”  and  other  tests  is  a  some¬ 
what  contrived  one  which  is  interesting  but  not  particularly  useful  in 
most  cases.  Generally  it  will  be  more  efficient  and  less  confusing 
to  employ  tests  expressly  designed  for  data  classified  into  groups  or 
categories  rather  than  to  seek  out  the  proper  modification  of  the  in¬ 
versions  test.  This  is  especially  true  when  samples  are  small  since 
the  exact  conditional  distribution  of  S  apparently  has  been  tabled  only 
for  the  case  where  there  are  ’’ties”  in  one  ranking  and  only  then  for 
n  <  10  (43). 

Several  tests  have  been  developed  which  do  not  belong  to  the 
category  discussed  above.  Whitfield  (55)  has  outlined  a  test  for  intra¬ 
class  correlation  of  ranked  data  and  tabled  its  exact  probabilities. 
Ranks  from  1  to  n  are  assigned  to  the  members  of  n/2  pairs  of  obser¬ 
vations.  The  pairs  are  then  arranged  in  order  of  the  lowest  rank  in 
each  pair,  i.e,  the  lowest  rank  among  the  remaining  observations 
not  yet  ordered.  Kendall’s  S  is  then  calculated  in  the  usual  manner 
except  that  no  observation  is  compared  with  its  paired  member  (but 
is  compared  with  all  n-2  other  observations).  S  max  is  therefore 

^  (n-2)  and,  since  S  min  is  zero,  tke  average  S  is  ^  (n-2).  Defining  his 

test  statistic  as  S  *  S-  ^  ^  Whitfield  tables  its  probabilities  for 

P  4 

n  ^  4n 

6  <  n  <  20.  He  finds  its  variance  to  be  - ^ -  so  that  large  sample 

=  =  18  ^ 

probabilities  can  be  obtained  by  referring  the  critical  ratio  to  normal 
tables.  Moran  (31)  has  outlined  a  curvilinear  ranking  test  in  which 
the  integer  1  is  moved  to  the  nearest  end  of  the  range  of  ranks,  then 
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the  integer  2  to  the  nearest  end  of  the  range  of  ranks  2  to  n,  etc,  , 
until  all  integers  have  been  so  treated.  The  test  statistic  is  the 
lease  number  of  interchanges  of  integers  required  to  effect  this. 

Exact  tables  have  been  prepared  for  2  5  n  5  14.  Daniels  and  Ken¬ 
dall  (6)  have  developed  a  large  sample  test  for  the  significance  of 
the  difference  between  two  correlations  when  the  correlations  in 
the  parent  populations  are  not  zero.  They  have  also  attacked  the 
problem  of  establishing  confidence  limits  for  a  rank  correlation 
when  a  nonzero  correlation  exists  in  the  parent  population.  Kendall 
(21)  has  established  a  partial  correlation  coefficient  based  on  ranks, 
but  has  been  unable  to  test  its  significance. 

h.  Tables.  Probabilities  for  S  have  been  tabled  for  n  5  10 
by  Kendall  (23,  24)  and  for  n  <  40  by  Kaarsemaker  and  van  Wijngaarden 
(19).  (Because  of  the  linear  relation  between  S  and  T,  the  probability 
of  the  T  corresponding  to  S  is  also  the  probability  of  S;  therefore,  T 
tables  can  also  be  used  to  test  for  rank  correlation  when  there  are  no 
ties.  See  3,  Inversions  as  a  Test  for  Linear  Trend.  ) 

If  there  are  no  ties  and  if  all  rank  permutations  are  equally 
probable,  i.e.  ,  if  x  and  y  are  uncorrelated,  the  distribution  of  S  rapidly 
approaches  the  normal  distribution  .  as  n  increases  (5,  20,  23,  35,  44). 
Asymptotic  normality  of  S  in  the  null  case  has  also  been  found  when  ties 
are  present  in  one  ranking  (52)  and,  under  certain  conditions,  when  both 
rankings  contain  ties  (7).  When  x  and  y  are  correlated,  the  dis¬ 

tribution  of  S  is  asymptotically  normal  under  certain  stated  conditions 
(16,  23). 


When  there  are  no  ties,  the  distribution  of  S  has  mean  zero 
and  variance  n(n- 1  )(2n+5)/ 1  8.  Therefore  when  n  is  too  large  for  the 
exact  probability  tables  to  be  applicable,  approximate  probabilities  can 

S 

be  obtained  by  referring  the  critical  ratio  ■  — -  to  normal 

Jn{n-1)(  2n+5) 

18 

tables.  The  approximation  can  be  improved  by  correcting  for  con¬ 
tinuity.  S  is  discretely  distributed,  successive  values  of  S  being  two 
units  apart;  therefore,  a  tail  area  of  the  S  distribution  whose  least 
extreme  value  is  S  would  be  represented,  on  a  continuous  curve,  by 
an  S  one  unit  less  extreme.  The  correction  for  continuity  therefore 
consists  of  decreasing  the  value  of  S  by  one  unit  if  S  is  positive  or  in¬ 
creasing  it  by  one  unit  if  it  is  negative,  before  calculating  the  critical 
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ratio.  If  ties  are  present,  different  continuity  corrections  are  re¬ 
quired  depending  upon  the  situation.  Some  of  these  corrections  have 
been  given  by  Kendall  (23). 

The  conditional  variance  of  S  given  that  certain  ties  exist  and 
are  assigned  the  midrank  has  been  given  by  Kendall  and  others  (23,  27, 
43).  If  only  one  ranking  contains  ties,  the  variance  of  S  is 


n(n-l)(2n+5)  -  ^’t  (t- 1  )(2t+5) 


18 


where  t  is  the  number  of  ranks  in  a  tied 


group,  i.  e.  ,  the  number  of  observations  tied  for  a  given  value,  the  sum¬ 
mation  being  taken  over  all  such  groups  (the  value  of  t  perhaps  varying 
from  group  to  group).  If  both  rankings  contain  ties,  the  variance  of  S  is 


n(n-l)(2n+5)  -  Y  t  (t- 1  )(2t+5)  -y  )  (2p.+5) 

Lj 


|Vt(t-l)(t-2)j^£|jL(ia-l)  (^x-2)j  (Z^  {Z  where  t  is 

9n  (n- 1)  (n-2)  2n  (n-1) 

defined  as  above  but  refers  only  to  the  ties  in  one  ranking  and  p  is  analo¬ 
gous  to  t  but  refers  to  ties  in  the  other  ranking.  The  mean  S  remains 
zero  when  there  are  ties  in  one  or  both  rankings.  When  critical  ratios 
are  referred  to  normal  tables,  the  proper  correction  for  continuity  de¬ 
pends  upon  the  tying  situation.  The  correction  for  certain  cases  has 
been  given  by  Kendall  (23). 

J*  Sources.  1-7,  9-11,  14-28,  31-36,  39,  42-44,  46-48, 

50,  55-57. 


3.  Inversions  as  a  Test  for  Linear  Trend 


a.  Rationale.  Suppose  that,  in  Kendall's  test  for  correlation, 
the  X  variable  were  the  time  at  which  a  unit  "appeared",  or  was  gener- 
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ated,  and  the  y  variable  were  some  quantitative  measure  on  the  unit 
itself.  Kendall*s  test  would  then  test  whether  or  not  the  size  of  the 
y  measurement  is  randomly  related  t.o  the  order  in  which  the  units 
were  generated,  and  it  would  be  particularly  likely  to  reject  the  hy¬ 
pothesis  of  randomness  if  there  were  a  linear  trend  in  the  generat¬ 
ing  process. 

The  test  can  be  applied  by  following  Kendall*s  procedure, 
in  which  case  the  test  statistic  is  S  and  Kendall^s  tables  are  the 
appropriate  ones  to  use,  or  by  following  a  slightly  different,  but 
equivalent,  procedure  outlined  by  Mann.  The  observations,  i,  e. 
y  measurements,  are  arranged  in  temporal  order  of  appearance,  and 
the  number  of  times  a  subsequent  measurement  exceeds  a  given  y  is 
counted  for  each  y  and  the  sum  obtained  for  all  ys.  This  sum  is 
called  T.  It  is  simply  the  complement  of  the  number  of  inversions 

and  is  related  to  I  and  to  S  in  the  following  manner  T  =  ~(n-1)  -  I  =  S+I. 

2 

b.  Null  Hypothesis.  Each  of  the  nl  possible  permutations 

of  order  for  the  size -rank  of  the  ys  was  equally  likely,  before  sampling, 
to  result  by  arranging  the  ys  in  the  temporal  order  in  which  they  were 
generated.  A  sufficient  condition  for  the  validity  of  the  null  hypothesis 
is  that  the  size  of  the  y  observations  is  uncorrelated  with  the  temporal 
order  in  which  they  are  generated  and  all  assumptions  are  true. 

c.  Assumptions.  The  observations  have  been  taken  indepen¬ 
dently  and  at  random  from  a  population  in  which  the  ys  are  continuously 
distributed,  or  exist  naturally  in  the  form  of  untied  ranks,  and  in  which 
ys  are  generated  one  at  a  time. 

d.  Treatment  of  Ties,  See  2,  Kendall*s  Rank  Order  Correla¬ 
tion  Test.  If  ties  are  given  the  midrank,  S,  rather  than  T  or  I,  should 
be  used  as  the  test  statistic, 

e.  Efficiency.  When  used  as  tests  of  randomness  against 

normal  regression  alternatives,  Mann*s  T  test  has  asymptotic  relative 
efficiency  of  or  .93  relative  to  the  parametric  test  based 

upon  the  regression  coefficient,  b  (49,  See  also  45),  It  is  therefore 
equal  or  superior  to  most  other  distribution-free  tests  for  trend.  See 
Table  I  in  Introduction  and  (45,  49,  13,  37,  38), 

The  test  is  consistent  and  unbiassed  (29,  14)  under  general 
conditions  stated  by  Mann  (29). 
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f.  Application,  If  x  is  time  in  the  example  given  under  Appli¬ 
cation  for  Kendall’s  test,  the  time-ordered  y  values  are  7,  4,  16,  53, 
84  for  which  T  =  9  and  S  =  8,  Significance  levels  may  therefore  be 
obtained  either  by  using  Mann’s  probability  tables  for  T  or  Kendall’s 
for  S. 


g.  Discussion,  See  2,  Kendall’s  Rank  Order  Correlation 

Test. 


Elfving  and  Whitlock  (12)  have  proposed  a  test  for  trend  which 
is  equivalent  to  T  pooled  over  several  sets  of  observations.  The  test 
statistic  is  equivalent  to  the  sum  of  r  Ts,  where  r  is  the  number  of 
sets  of  observations.  Its  mean  and  variance  are  the  respective  sums 
of  the  means  and  variances  for  the  individual  sets.  Thus,  in  effect, 
the  test  is  carried  out  by  referring  to  normal  tables  a  critical  ratio 

T.  n(n-l) 

^  i 


]  and  whose  denominator  is  the  square 


whose  numerator  is 


ZI 


root 


3  2 

2n  +  3n  -  5n 


72 


n  referring  to  the  number  of  observations  in 


a  set. 

n.  Tables.  Mann  (29)  has  tabled  the  exact  T  probabilities 
for  3  5  n5  10.  By  using  S  instead  of  T,  Kendall’s  tables  (23,  24),  or 
the  exact  tables  of  Kaarsemaker  and  van  Wijngaarden  (19)  can  be  used, 
the  latter  yielding  exact  probabilities  for  n’s  up  to  40. 


The  distribution  of  T  has  mean  —  (n-1)  and  variance 

4 

2  3  2  . 

— — 'LJR — and  approaches  a  normal  distribution  as  n  approaches 
7  2 

infinity  (29,  40,  54).  Therefore  when  n  is  too  large  for  the  exact 
tables  to  apply,  approximate  probabilities  for  T  may  be  obtained  by 


referring  the  critical  ratio 


(n-1) 


+  3n 


72 


5n 


to  normal  tables.  (To 


correct  for  continuity,  positive  numerators  should  be  decreased, 
and  negative  numerators  increased,  by  1/2),  However,  if  ties  exist 
and  are  given  the  midrank,  neither  the  T  tables  nor  the  normal  approx¬ 
imation  to  T  should  be  used.  Instead,  the  test  should  be  carried  out 
using  Kendall’s  S  as  the  test  statistic. 
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j.  Sources.  4,  8,  12,  13,  14,  19,  23,  24,  29,  30,  37,  38, 
40,  41,  43,  45,  49,  52,  53,  54. 


4.  Mann’s  K-Test 


a.  Rationale.  Mann  (29)  finds  that  "If  P  (X^  >  X  )  increases 
rapidly  with  j-i,  then  another  test  is  more  powerful  than  file  T-test.  " 
This  test,  the  K-test,  consists  of  arranging  the  observations  in  their 
order  of  appearance,  X^,  X^^,  X^,  ...,  ^  and  finding  "the  small¬ 

est  value  of  K  for  which  the  following  set  of  inequalities  is  fulfilled: 


X 

o 


>  X 


K 


X 

o 


>  ^  K+l- 


X  >  X  , 
o  n- 1 


X.  >  X 


K+r 


X  >  X 


n-l 


X 


n-K  -1 


>  X 


n-l 


The  probability  that  for  n  untied  observations  K  will  be  some  specified 
integer  K  is  simply  the  number  of  permutations  in  which  K  =  K  divided 
by  n!  ,  the  number  of  possible  permutations  of  the  n  observations. 


b.  Null  Hypothesis.  See  3,  Inversions  as  a  Test  for  Linear 

Trend. 

c.  Assumptions.  See  3. 

d.  Treatment  of  Ties.  Make  no  compromise  in  interpreting 
the  inequality  sign  (see  above)  when  determining  K,  The  probabilities 
thus  obtained  will  err  in  the  conservative  direction,  i,  e.  ,  rejection 
will  be  less  likely  than  if  there  were  no  ties. 


e.  Efficiency.  Mann  states  that  when  Pr  (X.  >X.)  increases 
rapidly  with  j-i,  the  K-test  is  more  powerful  than  the  T-test.  He 
notes  "that  the  K-test  is  most  powerful  with  respect  to  a  fairly  wide 
class  of  alternatives". 
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f.  Application,  In  the  example  given  for  the  T-test,  the  time- 
ordered  observations  are  7,  4,  16,  53,  84.  There  is  obviously  no 
downward  trend,  which  the  K-test  is  designed  to  detect.  However,  the 
presence  of  an  upward  trend  may  be  tested  by  reversing  the  signs  of 
the  inequalities  given  under  Rationale,  and  proceeding  in  the  manner 
outlined  for  downward  trend.  The  7  is  exceeded  by  all  observations 
from  the  16  on,  the  4  is  exceeded  by  each  of  the  two  observations 
following  the  16,  and  the  16  is  exceeded  by  the  84,  Therefore  K  is 
the  subscript  which  goes  with  16,  which  is  the  third  observation  in 
order,  therefore  having  the  subscript  2,  since  subscripts  start  with 
zero.  So  K  =  2  which  for  n  =  5  has  a  tabled  probability  of  .  0667. 

g.  Discussion,  This  test  has  two  outstanding  disadvantages. 
First,  it  is  easy  to  make  errors  in  determining  K.  The  determination 
of  K  involves  examining  several  possibilities  in  order  to  pick  the 
smallest  K  satisfying  a  rather  involved  set  of  inequalities.  And  the 
subscript  notation  is  a  confusing  one  since  K  is  one  unit  less  than  the 
positional  rank  of  the  observation  to  which  it  refers.  Furthermore, 
for  certain  order  permutations  there  is  no  value  of  K  which  satisfies 
the  inequalities.  (Zero  cannot  be  used  to  designate  K  in  this  situation 
because  zero  refers  to  the  first  observation  in  order  of  appearance,,^ 
Second,  the  K-test  apparently  is  restricted  to  one-tailed  tests  of  hypo¬ 
theses,  K  is  not  symmetrically  distributed,  so  the  two-tailed  probability 
cannot  be  obtained  by  doubling  the  one-tailed  probability.  And  the  value 

of  K,  for  a  test  of  downward  trend,  though  different  from  the  value  it 
takes  in  testing  for  upward  trend,  presumably  is  not  entirely  indepen¬ 
dent  of  it.  So,  if  the  presumption  is  correct,  two-tail  probabilities 
cannot  be  obtained  by  combining  probabilities  from  two  opposite  one- 
tailed  tests.  The  following  table  serves  to  illustrate  these  points. 

.  Mann  (29)  has  tabled  the  probability  that  K  ^  K 
^  Actually  these  tables  will  su'^fice  for  almost  any  prac¬ 

tical  application  regardless  of  the  value  of  n.  For  n  =  10  and  K  =  5, 
the  first  five  observations  are  compared  with  the  last  five  (i.  e.  ,  the 
following  comparisons  are  made:  Xq  with  X3,  X^,  X^,  Xg,  X^;  X^ 

with  Xg,  X^,  Xg,  Xg;  X^  with  X^,  Xg,  X^;  X3  with  Xg,  X^;  and 
X4  with  Xg),  Since  under  the  null  hypothesis  the  observations  are 
randomly  arranged  in  order,  for  n  >  10  the  test  may  be  arbitrarily 
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applied  to  only  the  ten  observations  consisting  of  the  first  five  and 
the  last  five  in  the  series.  When  n  ■  10,  the  probability  that  5 
is  .  0098.  In  this  case,  this  is  also  the  probability  that  K  <  n  -  5. 

When  n  is  greater  than  10  and  4  <K  <n  -  5,  if  the  set  of  inequalities 
holds  for  K,  it  will  also  hold  for  a  of  n  -  5  when  the  set  of  obser¬ 

vations  is  reduced  in  size  to  incude  only  the  first  five  and  last  five 
observations.  The  probability  of  the  latter  will  be  greater  than  that 
of  the  former,  but  the  increase  will  be  from  some  value  smaller  than 
.0  098  to  .  0098,  thus  still  being  beyond  the  .  01  level  of  significance. 
Therefore,  for  practical  purposes  only  the  first  five  and  last  fiv  e 
observations  are  necessary  to  conduct  a  reasonable  test  of  significance. 

i.  Sources.  29 
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TABLE  X 


TABULATION  OF  THE  DISTRIBUTION  OF  K  WHEN  n  =  4 


Size- 

Rank  Permutation 

K  =  1 

K  -  2 

K  =  3  K  =  Nothing 

1 

2 

3 

4 

X 

1 

2 

4 

3 

X 

1 

3 

2 

4 

X 

1 

3 

4 

2 

X 

1 

4 

2 

3 

X 

1 

4 

3 

2 

X 

2 

1 

3 

4 

X 

2 

1 

4 

3 

X 

2 

3 

1 

4 

X 

2 

3 

4 

1 

X 

2 

4 

1 

3 

X 

2 

4 

3 

1 

X 

3 

1 

2 

4 

X 

3 

1 

4 

2 

X 

3 

2 

1 

4 

X 

3 

2 

4 

1 

X 

3 

4 

1 

2 

X 

3 

4 

2 

1 

X 

4 

1 

2 

3 

X 

4 

1 

3 

2 

X 

4 

2 

1 

3 

X 

4 

2 

3 

1 

X 

4 

3 

1 

2 

X 

4 

3 

2 

1 

X 

Point  Probability 

.  042 

.  167 

.  292 

.  500 

Cumulative  Probability 

.  042 

.  209 

.  500 

1. 000 
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CHAPTER  Vni 


RUNS  OF  CONSTANT  PROBABILITY  EVENTS 


In  a  series  of  two  kinds  of  events,  a  and  b,  although  the 
proportionate  number  of  a*s  and  b*s  will  necessarily  depend  upon 
the  ratio  of  their  individual  constant  probabilities  of  occurrence, 
the  pattern  in  which  the  obtained  a*s  and  b*s  arrange  themselves 
will  not  and  will  be  random  unless  a*s  and  b^s  are  sequentially 
dependent.  In  that  case  like  events  may  tend  to  cluster,  and  this 
may  be  indicated  by  an  unusually  small  number  of  runs,  or  clusters 
of  like  objects,  in  the  pattern,  or  by  runs  of  unexpected  length.  Thus 
the  total  number  of  runs,  the  length  of  the  longest  run,  and  various 
other  run  statistics  can  be  used  as  the  sample  information  with  which 
to  test  for  randomness  of  pattern  of  arrangement  against  the  alterna¬ 
tive  of  sequential  dependency.  By  judicious  definition  of  the  two 
types  of  event,  this  test  can  be  employed  to  test  whether  two  sampled 
populations  are  identical,  whether  a  trend  exists  in  a  sequentially 
sampled  population,  whether  learning  is  taking  place,  etc.  Run 
tests  are  often  rather  weak  and  inefficient,  depending  upon  the  type 
of  application  contemplated.  However,  their  power  may  be  greatly 
increased  by  introducing  certain  modifications  (such  as  Ramachan- 
dran  and  Ranganathan's)  or  by  combining  the  run  test  with  an  inde¬ 
pendent  test  (as  in  David's  Chi-square  Smooth  test  of  goodness  of 
fit). 
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1.  Basic  Formulae 


A  run  is  an  unbroken  sequence  of  similar  events  or  like 
objects.  For  example  in  the  series  aababbbaa  there  are 
five  runs:  one  run  of  a's  of  length  1,  two  runs  of  a's  of  length  2, 
one  run  of  b's  of  length  1  and  one  run  of  b's  of  length  3.  The 
following  notation  will  be  used  in  the  derivation  of  run  formulae 
when  there  are  two  kinds  of  objects.  Let  r..  be  the  number  of  runs 

of  objects  of  type  i  whose  length  is  j  and  let  r^  be  the  number  of  runs 

of  objects  of  type  i  irrespective  of  length,  i.e.  of  all  lengths.  Let 
m  be  the  number  of  objects  of  type  i  and  let  n  be  the  number  of  ob¬ 
jects  of  both  types.  The  two  types  of  objects  will  be  designated  1  and 
2  respectively.  The  only  things  which  can  interrupt  or  terminate  a 
run  of  like  objects  are  a  run  of  the  other  type  objects  or  else  termi¬ 
nation  of  the  entire  series.  Therefore  r^^,  the  number  of  runs  of 

I's  can  either  be  one  greater  than,  equal  to,  or  one  less  than  r^j  the 

number  of  runs  of  2's.  When  ^  series  can  begin  (and  end) 

in  only  one  way  -  with  a  run  of  I's.  Likewise  when  r^=  1  the 

series  must  begin  and  end  with  runs  of  2's.  However,  when 

the  series  can  either  begin  with  a  run  of  I's  and  end  with  a  run  of 

2's,  or  begin  with  a  run  of  2's  and  end  with  a  run  of  I's.  Therefore 

it  will  be  convenient  to  introduce  the  notation  F(r  ,  r  )=  1  if  r  :^r 

12  12 

=  2  if  rj=  r^. 

The  runs  of  l*s  of  various  lengths  can  be  permuted  in 

T  ways.  But  a  permutation  which  merely  exchanges  the  positions 

of  runs  of  I’s  of  the  same  length  does  not  change  the  appearance  of 
the  series.  The  r^^  runs  of  I’s  of  length  j  can  be  permuted  in 

ways  without  changing  the  appearance  of  the  series.  Therefore, 
the  number  of  distinguishable  permutations  of  the  r^^  runs  of  I’s  is 


1* 


^11*  ^12*  ^In^* 

tations  of  runs  of  I's,  there  are 


For  each  of  these  distinguishable  permu- 


2* 


distin- 


.  t  _  I 

2r  22' 


r_  : 
2n2 


guishable  permutations  of  the  runs  of  2's.  And  since,  if 
rj^  =  r^  the  series  can  begin  in  two  ways,  the  total  number  of 
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distinguishable  permutations  given  that  there  are  r  runs  of  l*s 
and  r^  runs  of  2*s  of  specified  lengths  is 


-•l! 


r  ^ 


"■ir  ""iz-  •••  ""in  •  ''zr  ^zr  '’zil' 

1 


F(ri,  r  ).  Finally, 


since  there  are 


n: 


^1*  ^2* 


dintinguishable  permutations  of  n^  l*s 


and  n_  2*s,  the  probability  that  there  will  be  exactly  r  runs  of 
2  11 
l*s  of  length  1,  r  of  length  2,  etc.  ,  as  well  as  exactly  r  runs 
1 Z  21 

of  2*s  of  length  1,  of  length  2,  etc.  ,  given  that  there  are 


n^  I’s  and  n^  2's  in  the  series  is 


r  • 

1* 


Pr  (r..)  = 


F(ri,  r^) 


r  Ir,_!.,.r,  !  r-,Ir  * 


11*  12* 


In 


21*  "22*  *  *  *  "‘Zn^-  ni/n^Jn^I 


Suppose  that  we  are  interested  in  the  breakdown  of  runs 
of  l*s  according  to  length,  but  that  we  are  not  interested  in  the  cor¬ 
responding  breakdown  of  runs  of  2*s.  Considering  only  the  l*s, 


there  are 


r  I  r  * 

ir  12^  • 


distinguishable  permutations  of 


In  • 
1 


the  runs  of  l*s.  Now  imagine  the  n^  2*s  arranged  in  a  line. 

There  are  n  -1  spaces  between  2*s,  and  the  r  runs  of  2*s  can  be 
2  2 
obtained  by  selecting  r  -1  of  these  ri^-l  spaces  and  "widening** 

^  ^  n  —  1 

them  for  occupation  by  runs  of  l*s.  This  can  be  done  in 


ways.  If  r^  =  r^-l,  then  any  given  permutation  of  runs  of  l‘s  can 
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be  fitted  into  a  specified  2~^  spaces  between  2’s  in  only  one  way, 
since  the  series  must  start  and  end  with  a  run  of  2’s.  If  r^=r2+l, 

in  addition  to  the  ^2"^  spaces  between  2’^s  the  runs  of  I's  also 


occupy  the  space  to  the  left  of  the  leftmost  2  and  to  the  right  of  the 
rightmost  2.,  The  series  starts  and  ends  with  runs  of  I’s,  and 
the  r2-l  spaces  between  2-runs  are  occupied  by  the  second  to  the 
rj^-lst  1-run.  Again,  this  can  be  accomplished  in  only  one  way. 
However,  if  first  1-run  can  be  placed  either  to  the  left 

of  the  leftmost  2-run,  or  between  the  first  and  second  2-runs. 
Therefore  the  probability  of  exactly  r^^^  runs  of  I’s  of  length  1, 

^12  length  2,  etc.  ,  and  r^  runs  of  2's  of  any  lengths  given  that 
there  are  n^  I's  and  n^  2's  is 


Pr  (r 


Ij’  ^2 


)  = 


r 


T  *1“  * 

ir  12*  •  • 


In, 


(^2-J)  F(rj,  r^) 


^1-  ^2- 


Suppose  now  that  we  are  interested  in  neither  the  lengths 
nor  the  total  number  of  rvins  of  Z*s.  The  r^^  runs  of  l*s  can  be  in¬ 
serted  into  any  r^  of  the  ^2^^  spaces  before,  between,  and  after 
the  2*s,  i.  e.  ,  into  any  of  the  ^^-1  spaces  between  2*s  as  well  as  the 
space  to  the  left  of  the  leftmost  2  and  the  space  to  the  right  of  the 


rightmost  2, 


This  can  be  done  in 


(V‘) 

■■l 


ways. 


Therefore,  (the 


rest  of  the  derivation  being  analogous  to  that  given  earlier)  the 
probability  of  exactly  r^^j^  runs  of  I's  of  length  1,  ^  length 

2,  etc.  ,  given  that  there  are  n^  I's  and  n^  2's  is 


Pr  (r 


^1- 


1*  I  1*  I  «  ^ 

J-ll*  J-i^a  •••  X.  • 

11  12  In, 


(V')  /  — 


Since  the  number  of  runs  of  2's  is  unspecified,  it  may  be 
^1  ^1^^  term  F(r^,  r^)  is  not  required  in  the  formula. 


The  preceding  formulae  give  probabilities  for  the  entire 
run  pattern  in  the  sense  that  the  exact  number  of  runs  of  each  pos¬ 
sible  length  is  specified,  at  least  for  runs  of  one  type.  In  order  to 
obtain  the  more  general  probability  for  only  certain  specified  r  , 

i.i 
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one  fixes  these  r,.  as  constants  and  sums  the  formula  over  all  other 
ij 

values  for  which  the  relationships, 


n. 

1 


/  1  r..  =  n.  and  /.  r..  =  r.,  are  satisfied.  For  example  if 

j=l  ^  j  ^ 


n^^  =  7  and  n^  ■  9  the  probability  of  exactly  one  run  of  Is  of  length 
4  would  be 


^■•<'^14=  "  Z- 


r  * 


(lOj  /  161 


r  'r  'r  '1'  ^1/  7'Q' 

11*  12*  13*  *  ^  '  ^* 


41  ,10  3J  10  2]  10 

irrr  ^  ly  rj  ii  ^  3^  ^  2^ 

16! 

TT^ 


since  a  run  of  length  4  could  be  accompanied  by  three  runs  of  length 
1,  one  of  length  1  and  one  of  length  2,  or  by  one  of  length  3  while 
still  fulfilling  the  condition  that  n  =7  and  since  the  number  of  runs 
of  1  s  in  these  three  cases  is  4,  3,  and  2  respectively. 


Now  suppose  that  we  are  interested  in  number  of  runs, 
only,  and  not  in  their  lengths.  Imagine  the  n  Is  arranged  in  a 
line.  There  are  n,  -1  spaces  between  Is  and  \he  Is  can  be  sep¬ 
arated  into  r^  runs  by  selecting  and  "widening"  i'j“l  of  these 
spaces,  then  filling  them  with  runs  of  2s.  The  r  -1  spaces  can  be 


.n,  -1 


selected  in  (  1  )  ways.  For  each  of  these  ways  the  r  runs  of 

r  j  “  X  /j 

2s  (which  will  eventually  be  interlaced  with  the  Is)  can,  by  anal¬ 


ogous  reasoning,  be  selected  in  (  2  j)  ways.  Any  given  set  of 

^2' 

rj  runs  of  1  s  and  r^  runs  of  2s  can  be  fitted  together  in  one  way 

if  rj  =  r  ±  1  and  in  two  ways  if  r  -  r  .  The  number  of  distinguish- 
2  12 
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able  permutations  of  runs  of  Is  and  r^  runs  of  2s  is  therefore 

(^2"|)F(r^,  r  ).  The  number  of  distinguishable  permu- 
ri-1  r^-l  2 

tations  of  n.  Is  and  n  2s  without  restriction  as  to  numbers  of  runs 
1  2 


IS 


n: 


n,!  nl 


Therefore  the  probability  of  exactly  r^  runs  of  Is 


1  2 


and  r_  runs  of  2s  given  that  there  are  n  Is  and  n  2s  is 
2  ^  12 


nl 


^2* 


If  we  are  interested  only  in  the  number  of  runs  of  1  s  and 

are  indifferent  to  whether  r  equals  r  -1,  r,  or  r  +1,  we  still 

2  1  i  1 

select  the  r^  runs  of  Is  by  selecting  of  spaces  be¬ 


tween  Is  for  widening.  However,  now  the  spaces  before  and  after 

the  2s  as  well  as  the  spaces  between  2s  are  available  for  occupation 

by  Is  because  the  number  of  runs  of  2s  is  not  fixed.  Therefore 

there  are  n  +1  spaces  available  for  occupation  by  the  r  runs,  and 
2  1 


they  can  be  chosen  in 


■•l 


ways . 


The  rest  of  the  derivation 


is  analogous  to  that  described  earlier.  Therefore,  the  probability 
that  there  will  be  exactly  r^  runs  of  Is  given  that  there  are  n^  Is  and 

n^  2s  is 


Pr  (r  )  =  (“rj)  (V*) 

1  1 


n 


2* 


All  of  the  run  formulae  heretofore  listed  take  n 


1 


and  n 


2 


as 


given.  They  give  probabilities  conditional  upon  the  existence  of  ex¬ 
actly  n.  Is  and  n  2s  in  the  obtained  sample.  If  one  is  interested 
2 

in  the  arrangement  of  1  s  and  2s  but  not  in  the  probability  of  obtain¬ 
ing  a  1  or  a  2,  the  foregoing  formulae  are  generally  the  appropriate 
ones.  However  if  ”1**  and  "2”  are  mutually  exclusive  outcomes  of 
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a  binomial  event,  with  probabilities  p  and  q  respectively  of  occur¬ 
rence  on  a  single  trial,  the  experimenter  may  be  interested  in  the 
compound  probability  that  there  will  be  n^  Is  and  n^  2s  and  that 

their  arrangement  will  contain  a  specified  configuration  of  runs. 

This  compound  probability  is  obtained  by  taking  the  product  of  the 

n  ^1  ^2 

binomial  probability,  (  )  p  q  ,  cUid  whichever  one  of  the  prob- 

^1 

ability  formulae  listed  earlier  gives  the  appropriate  conditional 
probability  for  the  specified  configuration  of  runs. 

The  various  formulae  given  above  could  be  used  as  the 
bases  for  a  variety  of  statistical  tests  of  the  hypothesis  that  Is 
and  2s  are  arranged  randomly.  The  particular  formula  used 
would  depend  upon  the  conditions  taken  as  given  and  upon  the  alter¬ 
native  hypothesis  against  which  one  wished  the  test  to  be  most  sensi¬ 
tive.  However,  although  a  multiplicity  of  such  tests  are  possible, 
calculations  of  probabilities  generally  become  quite  involved  at  any 
but  the  smallest  sample  sizes.  Therefore,  in  the  following  sections 
only  those  tests  will  be  described  for  which  probabilities  have  been 
tabled. 


2.  The  Wald- Wolfowitz  Total  Number  of  Runs  Test 


a.  Rationale.  Suppose  that  two  samples  have  been  drawn 
(randomly  and  independently),  each  from  a  continuously  distributed 
population,  and  that  one  wishes  to  test  whether  or  not  the  parent 
populations  are  identical.  Let  the  sizes  of  the  two  samples  be  m 
and  n  and  let  their  observations  be  designated  as  xs  and  ys  respec¬ 
tively.  Now  arrange  the  m+n  observations  in  increasing  order  of 
magnitude  irrespective  of  the  sample  to  which  an  observation  ori¬ 
ginally  belonged.  Finally,  label  each  such  observation  x  or  y 
depending  upon  the  sample  from  which  it  came.  If  the  two  samples 
came  from  identical  populations,  then  the  pattern  of  arrangement 
of  xs  and  ys  is  a  random  one  since  x  and  y  are  arbitrary  labels  at¬ 
tached  to  observations  drawn  randomly  and  independently  from  the 
**same**  population.  However,  if  the  samples  are  from  different 
populations,  one  would  expect  observations  from  the  same  sample 
to  tend  to  cluster;  so  the  total  number  of  runs  should  tend  to  be 
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less  than  the  number  expected  on  a  purely  chance  basis. 

Let  U  stand  for  the  total  number  of  runs  of  both  xs  and 
ys.  Since  the  number  of  runs  of  xs  can  be  one  less  than,  equal 
to,  or  one  greater  than  the  number  of  runs  of  ys,  U  can  be  an  even 
number  in  only  one  way,  but  can  be  an  odd  number  in  two  ways. 
Substituting  into  the  formula 


Pr  (ri,  r^)  =  (“i;})  ("2:})  F  (r  j  ,  r^)/ (^J)  , 

If  r  j  =  r,  Pr  (r^,  r^)  =  2(^_^)  (  (  ^  )  , 


if  =  r  and  =  r  +  1,  Pr  (r^,  r^)  =  (^"J)  » 


and  if  r,  =  r  +  1  and  r  =  r,  Pr  (r.,  r  )  =  J)  / 

1  2  '  1'  2  '  r  r-lV  m 


Therefore,  the  probability  that  the  total  number  of  runs  will  be 

some  even  number,  2r,  is  Pr  (U  =  2r)  =  2(^*1*)  \)  / 

'r-  1  r-  1 V  m 

and  the  probability  that  it  will  be  some  odd  number,  2r+l  is 


Pr  (U  =  2r+l)  = 


Cij)  r;S  t 

<T> 


b.  Null  Hypothesis,  Given  that  there  are  m  xs  and  n  ys, 

each  of  the  distinguishable  arrangements  of  xs  and  ys  was 

equally  likely  to  have  been  the  arrangement  actually  obtained.  A 
sufficient  condition  for  the  validity  of  the  null  hypothesis  is  that  the 
X  observations  and  y  observations  were  drawn  from  identical  popu¬ 
lations  and  that  all  assumptions  are  true. 
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c.  Assumptions,  For  each  sample  the  observations 
were  drawn  randomly  and  independently  from  a  continuously 
distributed  population. 


d.  Treatment  of  Ties,  Ties  are  a  problem  only  when 
observations  from  both  samples  are  tied  for  the  same  position, 
or  rank,  in  order  of  increasing  magnitude.  In  many,  but  not 
all,  such  cases  the  resolution  of  ties  can  affect  the  total  number 
of  runs.  A  tie  for  which  the  total  number  of  runs  varies  de¬ 
pending  on  how  the  tie  is  broken  will  be  a  called  a  ’^critical”  tie. 

For  a  conservative  test  critical  ties  should  be  resolved  in  the 
manner  least  conducive  to  rejection  of  the  null  hypothesis.  How¬ 
ever,  if  one  wishes  to  minimize  the  average  error  in  probabilities, 
the  following  method  of  dealing  with  critical  ties  may  be  pursued. 
For  tied  groups  consisting  of  a  single  x  and  a  single  y,  randomly 
select  one-half  of  the  groups  and  resolve  ties  so  that  the  x  pre¬ 
cedes  the  y  with  which  it  is  tied;  for  the  remaining  half,  resolve 
ties  so  that  the  y  precedes  the  x;  if  an  odd  group  remains,  resolve 
the  ties  by  flipping  a  coin.  For  tied  groups  consisting  of  a  single 
X  and  two  ys,  resolve  ties  so  that  for  a  randomly  selected  1/3  of 
these  groups  the  order  is  xyy,  for  another  randomly  selected  1/3 
it  is  yxy,  and  for  the  remaining  1/3  it  is  yyx,  any  remaining  groups 
being  resolved  by  randomly  selecting  one  of  the  orders  xyy,  xyx, 
yyx,  a  different  randomly-selected  order  being  used  for  each  such 
group.  To  generalize:  if  there  are  k  groups  in  which  s  xs  and  t 
ys  are  tied  with  one  another,  resolve  ties  by  successively  selecting 


s~^  t 

(  )  of  the  k  groups  and  replacing  each  of  them  with  a  different, 

s 

s  ^  t 

randomly  assigned  one  of  the  (  )  distinguishable  orderings  of 

s 

s~^  t 

s  xs  and  t  ys;  if  k  is  not  divisible  by  (  )  resolve  ties  in  the 

s 


remaining  groups  by  randomly  assigning  each  of  them  a  different 

s-{- 1 

one  of  the  (  g  )  possible  orderings. 


e.  Efficiency.  When  applied  to  symmetrical  populations 
known  to  be  equal  in  all  respects  except  for  location,  a  test  for 
identical  populations  is  equivalent  to  a  test  for  equal  means.  When 
both  tests  are  applied  to  samples  from  normally  distributed  popu¬ 
lations  with  equal  variances,  the  Wald- Wolfowitz  form  of  the  run 
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test  has  relative  to  Student*s  t-test  an  asymptotic  relative  efficiency 
of  zero  (33  see  also  qualifications  stated  in  30,  33)  and  a  small  sample 
efficiency  which,  when  each  sample  contains  five  or  less  observa¬ 
tions,  generally  exceeds  .  96  and  may  be  as  high  as  .  995  (13),  It 
also  has  an  A.R.  E,  of  zero  relative  to  the  F  ratio  when  applied 
to  normal  populations  as  a  test  for  dispersion  (33),  The  test  com¬ 
pares  poorly  with  other  distribution-free  tests  (see  Table  I  in  Intro¬ 
duction),  It  had  the  least  power  of  the  tests  investigated  by  van 
der  Waerden  (47),  Epstein  (14),  and  Lehmann  (30),  the  former  two 
authors  sampling  from  normal  populations  with  homogeneous  var¬ 
iances,  the  latter  sampling  from  a  continuously  distributed  popu¬ 
lation,  It  was  found  by  one  or  more  of  these  authors  to  be  inferior 
in  power  to  the  following  tests:  Student’s  t,  van  der  Waerden*s  X-test, 
Lehmann^s  most  powerful  test,  Mann-Whitney  test,  Westenberg^s 
Median  test,  Epstein's  exceedances  test,  Smirnov's  maximum  devia¬ 
tion  test.  The  Wald- Wolf owitz  test  is  consistent  if  the  ratio  m/n 
of  sample  sizes  remains  constant  as  sample  sizes  m  and  n  approach 
infinity  and  if  certain  other  very  mild  conditions  are  met  (48,  29). 

If  the  ratio  m/n  does  not  remain  constant,  but  approaches  zero 
or  infinity,  the  test  is  inconsistent.  That  is  to  say,  -if  one  sample 
is  of  much  greater  size  than  the  other,  observations  from  the 
sample  of  smaller  size  are  almost  certain  to  be  separated  from  each 
other  by  observations  from  the  larger  sample;  thus,  the  number  of 
runs  will  tend  to  be  a  maximum  regardless  of  whether  the  null  hypo¬ 
thesis  is  true  or  false  (29). 

The  power  function  for  Steven's  form  of  the  run  test  has 
been  obtained  against  the  alternative  of  a  Markov  chain  by  David 
(10). 


f.  Application.  Suppose  that  a  sample  of  observations 
has  been  taken  on  randomly  selected  and  assigned  subjects  under 
each  of  two  treatments  and  that  it  is  desired  to  test  whether  the 
two  treatments  differ  in  any  measured  respect.  The  data  are 
shown  below. 

Treatment  X  5  14  23  6l  114  125  131 

Treatment  y  47  55  64  66  71 

If  the  data  are  arranged  in  order  of  increasing  magnitude  with  the 
sample  from  which  each  observation  came  listed  below  it,  we  have: 
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5  14 


23  47 

55  61 

64 

66 

71 

114  125 

131 

X  y 

y  X 

y 

y 

y 

X  X 

X 

There  are  three  runs  of  xs  and  two  of  ys,  so  U  =  5.  Entering  the 
probability  tables  for  runs  with  m  =  7,  n  =  5,  and  taking  the  small¬ 
est  numbers  of  runs  as  the  rejection  region,  we  find  that  the  largest 
value  of  U  significant  at  the  one-tailed  ,  05  level  of  significance  is  3. 
Since  U  =  5  in  the  above  data,  the  hypothesis  of  identical  populations, 
and  therefore  equal  treatment  effects,  cannot  be  rejected  at  the 
significance  level  chosen.  Since  a  casual  inspection  of  the  data 
strongly  suggests  that  the  populations  have  unequal  variances,  the 
above  example  serves  to  illustrate  the  weakness  of  the  test. 


g.  Discussion,  The  total  number  of  runs  can  be  used  as 
a  test  statistic  in  ways  other  than  that  described  for  the  Wald-Wolf- 
owitz  form  of  the  test.  Actually  the  total  number  of  runs  is  an 
appropriate  test  whenever  one  is  interested  in  the  randomness  of 
arrangement  of  mutually  exclusive  events,  fixed  in  number,  and 
constituting  a  dichotomy.  It  can  be  used  as  a  test  for  trend  by 
labeling  observations  above  and  below  the  median  as  x  and  y  respec¬ 
tively;  if  there  is  a  linear  trend,  the  number  of  runs  should  be 
smaller  than  that  expected  by  chance.  It  can  be  used  (19,  7)  to 
test  the  randomness  of  wet  and  dry  days  in  order  of  appearance; 
or  to  test  whether  occupied  seats  at  a  lunch  counter  tend  to  occur 
in  isolation,  bordered  by  vacant  seats(15).  In  such  cases  the  null 


hypothesis  is  simply  that  given  m  xs  and  n  ys  each  of  the 


( 


m+n 

m 


) 


distinguishable  arrangements  is  equally  probable.  The  assumptions 
are  that  there  are  only  two  mutually  exclusive  and  unconfusable  cate¬ 
gories  and  that  sampling  is  random  and  independent.  The  efficiencies 
found  for  the  Wald- Wolfowitz  test  relative  to  Students  t  are,  of  course, 
not  applicable  here.  The  formulae  for  Pr(U),  given  under  Rationale, 
apply  in  all  of  the  above  cases.  One  additional  case  in  which  it  does 
not  apply  is  that  in  which  the  mis  and  n  2s  are  arranged  around  a 
circle  rather  than  in  a  straight  line.  Stevens  (42)  has  derived  the 
probability  for  the  total  number  of  runs  in  this  case. 


h.  Tables.  Probabilities  for  U  have  been  tabled  by  Swed 
and  Eisenhart  (44)  for  m  <  n  <  20,  and  for  certain  other  cases.  Major 
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portions  of  their  tables  are  republished  in  (1-8,  1-23,  1-43,  ); 
smaller  portions  Ccin  be  found  in  (22,  23,  50).  David  (9)  has 
provided  tables  appropriate  when  m+n  <  14  zind  2  <  U  <  14. 

Z  m  n 

The  mean  and  variance  of  U  are  -  +  1  and 

m+n 


2mn  (2mn-m-n) 
(m+n)^  (m+n-1) 


respectively  and  U  is  asymptotically  normally 


distributed  if  the  ratio  of  sample  sizes  remains  constant  while 
sample  sizes  approach  infinity  (48).  Therefore,  when  samples 
are  too  large  for  the  tables  to  apply,  approximate  probabilities  can 
be  obtained  by  treating  U  as  a  normal  deviate  and  referring  the 


U  - 


critical  ratio 


2mn 

m+n 


-  1 


V 


2mn  (2mn-m-n) 

(m+n)^  (m+n-1) 


to  normal  tables.  (To  correct 


for  continuity,  reduce  the  absolute  value  of  the  numerator  by  1/2,) 
Generally  the  test  will  be  one-tailed  with  "too  few"  runs  constituting 
the  critical  region,  in  which  case,  of  course,  a  one-tailed  probability 
must  be  read  from  the  normal  tables  for  the  critical  ratio. 

i.  Sources.  9,  10,  13,  14,  15,  22,  23,  29,  30,  33,  34, 

35,  42,  44,  47,  48,  49,  50,  51,  52. 


3.  Length  of  the  Longest  Run 

a.  Rationale.  Just  as  the  total  number  of  runs  is  an  index 
of  a  possible  tendency  for  like  objects  to  cluster,  so  is  the  length  of 
the  longest  run.  Using  the  notation  of  Section  1,  the  probability 
that  the  longest  run  of  Is  will  be  of  length  S  can  be  obtained  by  taking 
nj^  and  n2  as  fixed  and  summing  the  formula 
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Pr  (r 


r 


1* 


V  *  r  ' 


l(S-l)* 


r  * 
IS* 


n_  +  l 


over  all  values  of  and  over  all  sets  of  ^12* 


^l(S-l)/lS 


which  satisfy 


Z 


n 


r 


r^,  and  r,_  >  1  and  such 


j=l  j 

that  Tj  exceeds  neither  n^^  -  S  +1  nor  n^  +  1.  The  probability 

that  the  longest  run  of  either  Is  or  2s  will  be  of  length  S  can  be 
obtained  by  an  analogous  attack  upon  the  formula 


Pr  (r.  .)  = 
ij 


1* 


.  *  r  ' 

11*  12* 


IS' 


T*  *  r 
21*  22 


- r  "2>  /O 

22*  ••*  ’^ZS*  ^ 


with  the  proviso  that  r  ana  r^g  cannot  both  be  zero  at  the  same 

time*  The  above  method  is  involved  and  considerably  more  con¬ 
venient  formulae  have  been  derived  for  such  probabilities  (1,  34, 
38,  49);  however,  their  derivation  is  not  as  simple  as  those  which 
have  been  presented  here. 


b.  Null  Hypothesis*  Given  that  a  sample  contains  n^^  Is 
ni+n^ 

and  n^  2s,  each  of  the  {  ^  )  distinguishable  arrangements  of 

Is  and  2s  was  equally  likely  to  have  been  obtained  prior  to  sampling* 


c*  Assumptions*  Sampling  is  random,  observations  are 
independent,  and  all  observations  can  be  unmistakably  classified 
into  one  of  two  mutually  exclusive  and  unconfusable  categories* 


d*  Treatment  of  Ties*  Ties  are  a  problem  only  when 
their  resolution  may  change  tlie  length  of  the  longest  run*  Such 
ties  should  be  resolved  in  the  manner  least  conducive  to  rejection 
of  the  null  hypothesis  or  else  dealt  with  in  a  manner  analogous  to 
that  outlined  for  critical  ties  in  Section  2,  The  Wald- Wolf owitz 
Total  Number  of  Runs  Test* 
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e.  Efficiency.  Power  functions  were  obtained  by  Bateman 
(1)  for  the  length  of  longest  run  and  for  the  total  number  of  runs  as 
tests  of  randomness  against  the  alternative  of  a  simple  Markov  chain, 
i.  e,  ,  that  each  event  is  dependent  upon  the  preceding  event  but  no 
other.  For  this  case  the  length  of  longest  run  test  was  found  to  be 
less  powerful  than  the  total  number  of  runs  test. 

f.  Application,  In  the  following  series  aabbaaaabb 

bbbbbabaaa,  the  longest  run  contains  7  like  objects.  Refer¬ 
ring  to  tables  of  probabilities  with  n^^  =  10,  ^^2  *  longest 

run=7,  the  probability  that  at  least  one  run  of  length  7  or  more  will 
occur  either  among  the  a*s  or  among  the  b*s  is  found  to  be  ,032,  The 
probability  that  at  least  one  run  of  7  or  more  b*s  will  occur  is  ,  017, 

g.  Discussion.  See  2,  The  Wald- Wolf owitz  Total  Number  of 
Runs  Test. 

h.  Tables,  Bateman  (1)  has  provided  probability  tables 
for  ”at  least  one  greatest  run,  of  either  kind  of  element,  of  given 
length’*  for  values  of  n^  +  n^  5  20.  These  are  point  probabilities, 

i,  e,  ,  are  for  one  or  more  runs  exactly  S  in  length,  Mosteller  (38) 
has  tabled  the  probability  of  at  least  one  run  of  length  S  or  greater 
among  elements  of  one  type,  either  type  or  each  type  for  n^  =  n  =5, 

10,  15,  20  or  25.  Portions  of  Mosteller *s  tables  have  been  r^ub- 
lished  by  (50,  1-15). 

i.  Sources,  1,  34,  38,  49,  50, 


4,  Length  of  Longest  Run  as  a  Test  for  Randomness  against 
Trend  Alternatives 


Suppose  that  a  series  of  observations  have  been  taken  upon 
a  continuously  distributed  variable  and  that  they  have  been  arranged 
in  the  order  in  which  they  were  drawn,  no  two  observations  having 
been  drawn  simultaneously.  If  each  observation  is  now  labeled  A 
or  B  depending  upon  whether  it  is  above  or  below  the  median  for  the 
entire  series,  the  presence  or  absence  of  trend  can  be  tested  by 
using  as  the  test  statistic  one  of  the  following:  the  length  of  the 
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longest  run  on  one  side,  either  side  or  both  sides  of  the  median. 

If  there  are  an  odd  number  of  observations  one  of  them  will  be  the 
median  and  it  should  be  discarded.  This  test  has  been  proposed 
by  Mosteller  (38)  who  has  published  appropriate  tables  for  the  cases 


where  n. 


"2 


10,  15,  20  or  25.  See  also  (50,  1-15). 


5.  Length  of  Longest  Run  in  Binomial  Trials 

a.  Rationale.  Rationale  of  3,  Length  of  Longest  Run, 
discussed  the  method  of  obtaining  the  formula  for  the  probability 
that  the  longest  run  of  Is  will  be  of  exactly  length  S.  This  prob¬ 
ability  was  obtained  by  taking  n  and  n^  as  fixed  constants,  and 
is  contingent  upon  their  having  the  values  assigned  them.  Let  Pr 
(S  I  n^)  stand  for  such  a  probability,  and  let  n  =  n  +  n  be  fixed. 

Now  suppose  that  the  occurrence  of  a  1  or  a  2  is  ^  binc^nial  event 
with  probability  p  or  q  respectively  for  a  single  trial.  If,  for  every 

possible  value  of  n  ,  Pr  (S|n  )  is  multiplied  by  (^)p^l  q^2  and  the 

1  1  ^1 

products  are  summed,  the  sum  is  simply  the  a  priori  probability 
that  in  n  binomial  trials  the  longest  run  of  consecutive  Is  will  be 
of  exactly  length  S.  More  convenient  methods  and  formulae  are  used 
in  actual  tabulation  of  probabilities  (21,  34,  46,  49). 


b.  Null  Hypothesis.  The  probability  that  in  n  trials  there 
n  ^1  ^2 

will  be  exactly  n  Is  is  (^  )  p  q  and  for  any  obtained  value  of  n^^ 
^1 ^^2  ^ 

each  of  the  (  )  distinguishable  arrangements  of  Is  and  2s  is 

^1 

equally  probable. 


c.  Assumptions.  Sampling  is  random;  observations  are 
independent;  1  and  2  are  mutually  exclusive  outcomes  of  a  binomial 
event  with  constant  probabilities  p  and  q  =  1  -p  for  a  single  trial. 

d.  Treatment  of  Ties.  Break  ties  in  the  manner  least 
conducive  to  rejection  of  the  null  hypothesis. 
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e.  Efficiency.  No  information  available. 

f.  Application.  An  experimenter  wishes  to  test  whether 
or  not  a  monkey  can  learn  to  associate  a  red  light  with  food.  The 
monkey* s  food  is  always  hidden  in  one  of  five  boxes  and  the  ‘'re¬ 
ward**  box  is  always  illuminated  by  a  red  light.  The  probability 
of  "success**  on  a  single  trial  is  therefore  1/5  if  the  null  hypo¬ 
thesis  of  no  learning  is  true.  Consulting  Grant*s  tables  (20) 

the  experimenter  finds  that  when  p  =  1/5  a  run  of  4  or  more  suc¬ 
cesses  in  40  trials  is  significant  at  the  .05  level.  Therefore  he 
decides  to  run  not  more  than  40  trials  and  to  reject  the  null  hypo¬ 
thesis  whenever  the  number  of  consecutive  successes  reaches  4. 

The  monkey's  successes  and  failures  to  go  first  to  the  red-illumin¬ 
ated  box  are:  FFFSFFFFFSSFSSSS,  so  only  16  of  the 
maximum  of  40  trials  had  to  be  run.  The  significance  level,  how¬ 
ever,  is  not  reduced  but  remains  .  05  since  it  had  originally  been 
intended  to  rxin  as  many  as  40  trials  if  necessary, 

g.  Discussion.  The  question  arises  as  to  which  type  of 
test  is  appropriate,  that  which  treats  n  and  n  as  given  or  that 

which  treats  p  as  given.  Mosteller^s  test  for  trend  takes  n^  = 
n^  =  ^  and  indeed  this  must  be  the  case  since  n  continuously  dis¬ 
tributed  observations  are  being  classified  as  above  or  below  their 
own  median.  In  this  case  it  would  be  very  improper  to  treat 
**above  the  median**  as  a  binomial  event  with  probability  1/2  since 
in  n  trials  of  such  an  event,  n^^  should  be  able  to  assume  any  value 

from  zero  to  n,  which  is  obviously  impossible  if  n^  is  the  number 

of  the  n  observations  above  the  median  of  the  same  n  observations. 
Similarly  if  one  were  interested  in  the  randomness  of  a  seating 
arrangement,  one  would  take  the  observed  number  of  occupied  and 
unoccupied  seats  as  given  since  it  is  only  the  pattern  of  occupancy, 
not  the  probability  of  occupancy,  in  which  one  is  interested. 

On  the  other  hand  suppose  that  one  knows  that  he  is  dealing 
with  a  binomial  event  (which  is  free  to  occur  any  number  of  times 
from  zero  to  n  in  n  trials)  and  that  one  can  state,  a  priori,  the  exact 
value  of  the  constant  parameter  p.  Then  by  using  the  **binomial** 
approach  outlined  xinder  Rationale  one  need  only  conduct  that  number 
of  trials  between  S  and  some  predetermined  value,  n,  necessary  to 
produce  the  criterion  of  S  consecutive  successes.  Research  effic- 


210 


iency  has  therefore  been  gained.  Furthermore,  when  used  as  a 
test  for  learning,  as  outlined  by  Grant  (20,  21)  and  as  conducted 
under  "Application",  the  "binomial"  approach  has  particularly 
desirable  features,  i.  e.,  the  test  is  particularly  sensitive  to  the 
alternative  hypothesis.  When  learning  begins  p  (which  is  con¬ 
stant  only  if  the  null  hypothesis  of  no  learning  is  true)  increases. 
This  causes  n^^  to  tend  to  assume  a  value  greater  than  chance 
would  have  given  it.  And  naturally  with  a  greater  number  of 
successes  there  hre  more  ways  of  obtaining  a  run  of  S  consecu¬ 
tive  successes  auid  the  probability  of  a  run  of  length  S  increases 
simply  because  of  the  "inflated"  value  of  n  .  Learning,  however, 
also  increases  the  probability  that  successful  trials  will  be  tem¬ 
porally  adjacent.  Therefore,  learning  makes  rejection  particu¬ 
larly  likely  by  increasing  both  the  probability  of  temporal  associa¬ 
tion  among  the  number  of  successes  occurring  and  by  tending  to 
increase  the  number  of  successes  beyond  what  would  be  expected 
if  the  null  hypothesis  were  true. 

h.  Tables.  Grant  (20,  21)  has  tabled  the  probability  of 

a  run  of  at  least  S  successes  in  n  trials  for  the  following  values  of 
p  ;  1/2,  1/3,  1/4,  1/5.  See  also  (18). 

i.  Sources.  4,  5,  7,  8,  15,18,  19,  20,  21,  34,  39, 

46,  49. 


6.  The  Sum  of  Squared  Run  Lengths 

The  Wald- Wolf owitz  total  number  of  runs  test  is  one  of 
the  least  powerful  distribution-free  tests  for  goodness  of  fit,  i.  e, 
that  two  samples  were  drawn  from  identical  popiilations.  Pre¬ 
sumably  this  is  partly  because  the  total  number  of  runs  does  not 
directly  take  account  of  the  lengths  of  runs  which  are  the  more 
explicit  indices  of  the  tendency  of  like  objects  to  cluster.  The 
length  of  the  longest  run,  by  taking  account  of  only  the  longest  run, 
ignores  the  "information"  contained  in  the  lengths  of  the  less-than- 
longest  runs.  And  in  the  case  investigated  by  Bateman  (1)  this 
statistic  was  fotmd  to  be  less  powerful  than  the  total  number  of  runs. 
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Ramachandran  and  Ranganathan  (40)  have  proposed  a  test 
which  overcomes  the  objections  voiced  above*  Their  test  statistic, 
N,  is  the  sum  of  the  squares  of  lengths  of  runs,  i.  e.  ,  N  = 


2j 


Thus  all  runs  are  taken  account  of,  but 


Y.2  V  ,2 

I.J  r  +  /,  j 

j  ^  j 

each  run  is  permitted  to  influence  the  test  statistic  in  proportion 
to  the  square  of  its  length.  Its  authors  recommend  the  test  for 
the  same  situation  dealt  with  by  Wald  and  Wolfowitz,  i.  e,  obser¬ 
vations  are  arranged  in  increasing  order  of  magnitude  and  runs  of 
Sample  1  observations  and  of  Sample  2  observations  are  noted,  the 
test  being  used  to  decide  whether  the  two  samples  belong  to  identical  , 
continuously  distributed,  populations.  The  authors,  considering 
only  the  case  where  n^  =  n^,  have  tabled  values  of  N  required  for 

various  levels  of  significance.  The  tabled  values  of  N  are  exact 
for  the  cases  3  ^n^  ^5  and  approximate  for  6  5  n^  5  15,  in  the 

latter  case  having  been  obtained  by  reading  points  from  a  Type  VI 
curve  fitted  to  the  true  distribution  of  N. 


7.  Dixon^s  Test 


A  test  analogous  to  that  of  Ramachandran  and  Ranganathan 
was  proposed  earlier  by  Dixon  (12).  Two  samples  of  sizes  m  and  n, 
with  n  <  m,  are  drawn  from  continuously  distributed  populations  and 
arranged  in  order  of  increasing  magnitude  irrespective  of  sample. 
There  are  n  +  1  spaces  between,  before  and  after  the  n  observations 
into  which  the  m  observations  may  be  distributed.  If  the  two  sam¬ 
ples  are  from  the  same  population,  one  would  expect  the  proportion 

of  the  m  observations  actually  falling  into  a  specified  space  to  be 
2  m . 

- .  Therefore  Dixon  subtracts  the  observed,  proportion -  , 

n+1  m 

where  is  the  number  of  such  observations  actually  falling  in 

th  1 

the  i’^^  space,  from  the  expected  proportion  -  ,  and  squares 

n+1 

the  difference.  This  is  done  for  each  value  of  i  from  1  to  n+1. 

The  sum  of  these  n+1  squared  differences  is  taken  as  the  test 
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statistic  and  called  c  ,  Probability  tables  are  provided  for  c^ 
for  cases  in  which  neither  m  nor  n  is  greater  than  10,  For  larger 
values  of  m  or  n  approximate  probabilities  can  be  obtained  by  a 


procedure  which  relates  c 
details  see  (12), 


to  the  chi  square  distribution.  For 


The  quantity  m.  is  of  course  the  length  of  the  run  of  ob¬ 
servations  from  the  sample  of  size  m  which  occupies  the  i^^  inter¬ 
val  ’’between**  observations  from  the  other  sample  of  size  n.  How¬ 
ever,  since  the  i^^  interval  may  be  unoccupied,  m.  may  be  zero, 

1  ™i 

Therefore  the  quantity  squared  by  Dixon,  i,  e,  ,  -  -  -  is  not 

n+1  m 

directly  comparable  to  the  quantity  squared  by  Ramachandran  and 
Ranganathan,  i,  e,  ,  the  length  of  an  actually  obtained  run  which 
therefore  cannot  be  zero.  Another  way  of  putting  it  is  that  while 

the  value  — i —  is  the  expected  proportion  of  m-sample  observa¬ 
nt  1 

tions  falling  in  the  i^^  interval,  it  is  not  the  average  length  of 
obtained,  m-sample,  runs. 

Still  another  test  somewhat  similar  to  the  two  discussed 
above,  as  well  as  to  the  Mann  Whitney  test  has  been  outlined  by 
Mathen,  See  (32), 


8,  David’s  Chi  Square  ’’Smooth”  Test  of  Goodness  of  Fit 

One  of  the  classic  criticisms  of  the  chi  square  test  of 
goodness  of  fit  is  that,  since  deviations  from  expected  values  are 
squared  before  being  divided  by  the  expected  value  and  summed, 
the  test  does  not  take  account  of  the  directions  of  deviations.  For 
example,  consider  the  following  table  in  which  the  columns,  from 
left  to  right,  represent  the  corresponding,  successive,  abscissa 
intervals. 
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9 


8 


7 


6 


5 


15 

14 

13 

12 

^e 

10 

10 

10 

10 

5 

4 

3 

2 

11 

10  10  10  10  10 

1  -1  -2  -3  -4 


10 

-5 


K  the  only  restraint  is  that  ,  then  there  are  9  degrees 

2 

of  freedom  and  the  obtained  value  of  1 1  for  X  has  a  probability  of 
about  .  30.  Although  there  is  a  strong  indication  that  the  left  por¬ 
tion  of  the  true  curve  lies  above,  and  the  right  portion  lies  below, 
the  hypothesized  curve,  chi  square  ignores  this  information  and, 
dealing  only  with  the  magnitudes  of  the  deviations,  falls  short  of 
significance. 

David  (9)  has  proposed  a  test  which  takes  account  of  both 
the  magnitude  and  the  direction  of  the  deviations.  The  test  is 
generally  applicable  (for  reasons  and  for  exceptions  see  9,  H,  17, 
41)  only  when  there  is  a  single  linear  restraint  upon  chi  square, 
i.e.  ,  when  the  sum  of  the  expected  frequencies  has  been  made  to 
equal  that  of  the  observed,  so  that  the  number  of  degrees  of  freedom 
is  one  less  than  the  number  of  deviations.  The  data  are  arranged 
in  a  table,  similar  to  the  one  shovm,  with  each  column  in  the  same 
relative  position  as  the  abscissa  interval  from  which  its  data  were 
taken.  The  chi  square  test  ^s  conducted  in  the  usual  way  and  its 
cumulative  probability,  P  (X  ),  is  obtained.  Then  the  total  number 
of  runs  of  plus  and  of  minus  deviations  is  counted  among  the  devia¬ 
tions  as  they  are  arranged  in  the  table.  This  number  is  referred  to 
a  probability  table,  supplied  by  David,  which  gives 

U 

o 

Pr  (U  <  U  )  =  Ti  Pr(u|  n.n  ),  i.e.,  which  gives  the  probability 

=  °  U=2  ^  ^ 

of  the  obtained  number  of  runs  cumulated  from  2  to  the  obtained 
number  and  conditional  upon  the  existence  of  n^^  plusses  and  n^ 

minusses.  (Since  ^f  has  been  made  to  equal  f^,  ^  f^  f^=  0, 

i.  e.  ,  the  sum  of  the  deviations  must  equal  zero,  and  there  must 
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be  at  least  one  positive  and  one  negative  deviation.  Therefore, 
since  one  run  is  impossible,  the  cumulation  starts  with  two.  How¬ 
ever  this  qualification  is  automatically  imposed  whenever  both  nj^ 
and  are  different  from  zero,  so  any  set  of  total-number-of 
runs  tables  is  appropriate  if  entered  with  the  obtained  values  of 
ni  and  n^,  neither  of  which  can,  in  this  application,  equal  zero.  ) 

The  chi  square  test  and  the  toal  number  of  runs  test  are 
independent.  Therefore  a  single  significance  level  can  be  obtained 
for  the  two  tests  by  calculating  their  joint  probability.  This  is 
somewhat  complicated  by  the  fact  that  chi  square  is  a  continuously 
distributed  variable  while  the  distribution  of  the  total  number  of 
runs  is  discrete.  However,  David  (9)  has  simplified  matters  by 
tabling  this  joint  probability.  Thus  one  obtains  the  product  of  P(X^), 
the  cumulative  probability  of  the  obtained  X^,  and  P(U),  the  prob¬ 
ability  of  the  total  number  of  runs  cumulated  from  U  =  2  to  the  ob¬ 
tained  value.  David's  tables  give  the  values  of  this  product  which 
are  significant  at  the  •  05  and  .  01  levels  of  significance  for  values 
of  +  n^  514.  It  is  particularly  important  that  expected  cell  fre¬ 
quencies  should  be  large  enough  for  the  binomial  sampling  distribu¬ 
tion  of  ’’observed**  frequencies  to  be  well  approximated  by  a  normal 
distribution.  This  is  the  case  because  **an  assumption  implicit  in 
the  test  would  appear  to  be  that  for  each  X^  cell  there  is  an  equal 
chance  of  obtaining  a  positive  or  a  negative  deviation*'.  Furthermore, 
the  independence  of  the  chi  square  and  run  tests  relates  to  the  the¬ 
oretical,  continuously  distributed  chi  square  distribution,  not  to 
chi  square  as  calculated  from  the  sample.  The  discrepancy  be¬ 
tween  the  two  "chi  squares**  is  neglibible  when  expected  cell  fre¬ 
quencies  are  large,  and  effective  independence  can  be  expected  to 
obtain;  however,  there  is  no  certainty  that  the  chi  square  and  run 
tests  continue  to  be  independent  when  expected  frequencies  are  small. 


9.  Extensions  of  Run  Theory 

Runs  discussed  so  far  have  involved  only  two  kinds  of  ele¬ 
ments  arranged  in  a  linear  sequence.  However  various  probability 
formulae  have  also  been  derived  for  runs  of  like  elements  when 
there  are  more  than  two  kinds  of  elements  (34,  43,  49)  and  for  runs 
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where  adjacency  among  elements  can  occur  along  two  or  more 
dimensions  (3,  16,  24,  25,  26,  27,  31,  36,  37,  45).  Such  multi¬ 
ple-category  and  polydimensional  runs  are  generally  analysed  on 
the  basis  of  large  sample  theory,  using  critical  ratios,  rather  than 
exact  probabilities,  since  the  exact  distribution  of  such  runs  rapidly 
becomes  difficult  to  tabulate  as  sample  size  increases. 
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CHAPTER  IX 


RUNS  UP  AND  DOWN 


A  type  of  run  test  for  trend  can  be  obtained  by  defining 
a  run  as  an  unbroken  sequence  of  increasing  or  decreasing  ob¬ 
servations.  In  this  case  the  two  kinds  of  events,  "greater  than 
the  preceding  observation"  and  "smaller  than  the  preceding  ob¬ 
servation,  "  are  neither  fixed  in  number  nor  of  constant  probability 
(since  their  probabilities  depend  on  how  "extreme"  was  the  pre¬ 
ceding  observation).  Thus  the  formulae  developed  in  the  pre¬ 
ceding  chapter  are  inappropriate.  By  investigating  the  proba¬ 
bility  for  a  given  pattern  of  observation  magnitudes,  rather  than 
a  given  pattern  of  dichotomized  "events,  "  the  necessary  formulae 
are  obtained.  Run  tests  of  this  type  have  used  the  total  number 
of  runs,  the  length  of  the  longest  run,  or  chi-square  applied  to 
frequencies  of  runs  of  various  lengths,  as  the  test  statistic. 
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1.  Introduction 


Suppose  that  n  observations  have  been  taken  on  a  contin¬ 
uously  distributed  variable  and  arranged  in  the  order  in  which 
recorded.  A  continuously  ascending  sequence  of  observations 
will  be  defined  as  a  run  ’*up’’  and  a  monotonicaUy  decreasing 
sequence  will  be  called  a  run  ’’down”.  Now  suppose  that  each 
observation  is  subtracted  from  the  succeeding  observation. 

There  will  be  n-1  algebraic  signs  to  replace  the  n  original 
observations.  A  run  ”up”  will  now  be  more  definitively  in¬ 
dicated  by  a  sequence  of  +  signs,  and  a  run  ’’down”  will  be  un¬ 
ambiguously  identified  by  a  run  of  -  signs.  The  farther  an 
observation  is  from  the  median  of  the  series,  the  less  likely  it 
will  be  that  the  succeeding  observation  will  depart  from  the  median 
still  farther.  Therefore  ’’plus”  and  ’’minus”  are  not  constant 
probability  events  and  probability  formulae  for  runs  up  and  down 
must  be  derived  in  the  light  of  that  fact. 

Consider  the  probability  that  the  i^^  observation  ob¬ 
tained  initiates  a  run  up  of  exactly  S+1  observations  so  that  the 
difference  sign  obtained  by  subtracting  the  i^^  from  the  i+lst 
observation  is  the  first  +  in  a  sequence  of  exactly  S  plusses.  A 
run  up  of  S+1  observations  must  begin  with  the  first  observation 
in  the  entire  series  when  n*S+land  it  must  either  begin  with  the 
first  or  end  with  the  last  observation  when  n*S+2.  In  order  to 
examine  the  general  case  where  the  run  can  initiate,  terminate 
or  lie  enclosed  within  the  series,  assume  that  n  ^  S+3.  Consider 
first  the  probability  that  the  series  begins  with  a  run  up  of  exactly 
S+1  ascending  observations.  Let  the  first  S+2  observations  be 
replaced  by  their  ranks,  from  1  to  S+2,  in  order  of  increasing 
magnitude.  If  the  series  is  random,  i.  e.  ,  contains  no  true  trend, 
each  of  the  (S+2)’  permutations  of  these  S+2  observations  is 
equally  probable.  But  in  order  for  the  series  to  begin  with  a 
run  up  of  exactly  S+1  ascending  observations,  the  S+2  ranks  must 
be  arranged  so  that:  (a)  the  rank  S+2,  i,  e,  ,  the  highest  among 
the  S+2  observations,  occupies  the  S+lst  position,  (b)  any  one  of 
the  remaining  S+1  ranks  occupies  the  S+2nd  position,  (c)  the  re¬ 
maining  S  ranks  are  arranged  in  order  of  increasing  size.  Of 
these  three  requirements,  (a)  can  be  fulfilled  in  only  one  way, 

(b)  can  be  accomplished  in  S+1  ways  and  (c)  can  then  take  place 
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in  only  one  way.  So  the  probability  that  the  series  begins  with  a  run 

S+1 

of  increasing  observations  of  exactly  length  S+1  is  •  This 

is  also  the  analogously  derived  probability  that  the  series  ends  with 
a  run  up  of  exactly  S+1  ascending  observations,  i.  e.  ,  that  a  run  up 
of  S+1  observations  begins  with  the  n-S^^  observation* 

Now  consider  the  probability  that  a  run  up  of  S+1  ascend¬ 
ing  observations  begins  at  some  specified  position,  i,  where 
2<i<n-S-l,  i.e.,  excluding  the  cases  where  the  run  begins 
or  ends  the  series*  Let  the  i-lst  to  the  i+S+lst  observations 
be  ranked  from  1  to  S+3  in  order  of  increasing  magnitude*  If 
the  series  is  random,  each  of  the  (S+3)l  permutations  of  order 
for  these  S+3  ranks  is  equally  likely*  But  only  in  the  following 
ways  can  the  S+3  ranks  be  arranged  so  that  the  first  is  higher  then 
the  second,  the  second  to  the  S+2nd  form  an  ascending  sequence, 
and  the  S+3rd  is  lower  than  the  S+2nd:  (a)  Rank  1  occupies  the  2nd 
position,  rank  S+3  occupies  the  next  to  last  position,  any  one  of  the 
remaining  S+1  ranks  is  placed  in  the  first  position,  any  one  of  the 
remaining  S  ranks  is  placed  in  the  last  position,  and  the  remaining 
S-1  ranks  are  arranged  in  increasing  order  of  magnitude  from  3rd 
to  second  from  last  position.  (b)  Rank  1  occupies  the  second  position, 
rank  S+2  occupies  the  next  to  last  position,  rank  S+3  occupies  the 
first  position,  any  one  of  the  remaining  S  ranks  is  placed  in  the  last 
position,  and  the  remaining  S-1  ranks  are  arranged  in  increasing 
order  of  magnitude  from  3rd  to  second  from  last  position*  (c) 

Rank  2  occupies  the  second  position,  rank  S+3  occupies  the  next 
to  last  position,  rank  1  occupies  the  last  position,  any  one  of  the 
remaining  S  ranks  is  placed  in  the  first  position,  and  the  remaining 
S-1  ranks  are  arranged  in  increasing  order  of  magnitude  from  3rd 
to  second  from  last  position,  (d)  Rank  2  occupies  the  second  posi¬ 
tion,  rank  S+2  occupies  the  next  to  last  position,  rank  S+3  occupies 
the  first  position,  rank  1  occupies  the  last  position,  and  the  remain¬ 
ing  S-1  ranks  are  arranged  in  order  of  increasing  magnitude  from 
3rd  to  second  from  last  position*  There  is  only  one  way  in  which 
a  specified  rank  can  be  assigned  to  a  specified  position  and  only  one 
way  in  which  S-1  ranks  can  be  arranged  in  order  of  increasing  mag¬ 
nitude  in  S-1  positions.  Therefore,  the  number  of  ways  in  which 
(a),  (b),  (c),  and  (d)  can  be  accomplished  is  (S+1)S,  S,  S,  and  1 
respectively*  The  probability  that  a  run  up  of  exactly  S+1  ascend¬ 
ing  observations  begins  at  a  predesignated  position,  i,  when 
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2  <  i  <  n  -  s 


1 ,  is  therefore 


S^+3S+1 
(S+3)!  * 


We  have  seen  that  when  n  >  S  +  3,  the  probability  that  a 
run  of  exactly  S+1  ascending  observations  begins  with  the  i^^  obser¬ 


vation  is 


S+1 

(S+2)! 


when  i  =  1  or  when  i 


n-S  and  is 


S^+3S+1 

(S+3)! 


when 


i  is  any  one  of  the  n-S-Z  values  between  Z  and  n-S-1.  These  are 
probabilities  that  the  i^^  observation  initiates  a  run  up  of  specified 
length,  i.  e.  ,  each  probability  is  conditional  upon  the  i-lst  obser¬ 
vation,  if  there  is  one,  not  being  a  continuation  of  the  run.  Other¬ 
wise  viewed,  each  probability  is  conditional  upon  the  i^^  observa¬ 
tion  not  being  a  continuation  of  any  run  up  which  began  at  some 
point  earlier  in  the  series.  Therefore,  since  the  probabilities 
do  not  refer  to  overlapping  events,  they  can  be  summed  over  all 
possible  values  of  i  to  obtain  the  expected  number  of  runs  of  the 
specified  type.  Thus,  when  n  >  S+3  the  expected  number  of  runs 
of  ascending  observations  of  length  exactly  S+1  or  of  plus  differ- 

.10-  2(s+l)  ,  (n-S-2){S^+3S+l) 

ence  signs  of  length  exactly  S  is  - —  which 

(o+  Z ) .  \^+  )  • 


reduces  to 


n(S^+3S+l)  -  (S^+3S^-S-4) 

(S+3)! 


Following  analogous 


derivations,  it  is  clear  that  when  n  =  S+Z  the  expected  numberof  runs 
up  of  exactly  S+1  observations  is  and  when  n  =  S+1  it  is 


1 

(S+1): 


(It  should  be  noted  that  these  derivations  are  based  upon 


the  n  observations  being  in  a  random  order,  not  upon  each  difference 
sign  of  a  given  type  being  equally  likely, which  is  not  the  case.  ) 


The  expected  number  of  runs  up  of  ascending  observations 
of  length  S+1  or  longer  is  derived  in  a  manner  analogous  to  that 
already  presented,  dropping  the  restriction  that  the  S+lst  obser¬ 
vation  composing  the  run  be  followed  by  a  lower  observation.  Thus 
assuming  n  >  S+Z,  one  requires  only  that  when  i  *  1  the  S+1  obser¬ 
vations  beginning  with  the  i^^  are  arranged  in  order  of  increasing 
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magnitude  and  that  when  2  <  i  <  n  -  S,  in  addition  to  the  above  re¬ 
quirement  the  i-lst  observation  is  higher  than  the  i^^.  The  ex¬ 
pected  number  of  runs  of  ascending  observations  of  length  S+1  or 


greater  is  therefore 


1 

(S+1)I 


(n-S-1)  (S+1) 
(S+2): 


or 


n(S+l) - (S^+S-1) 

(S+2).' 


when  n  >  S+2  or 


1 

(S+1)! 


when  n 


S+1. 


And  if  1  is  substituted 


for  S  in  the  above  formulas,  the  result  is  the  expected  number  of 
r\ins  of  ascending  observations  of  length  S+1  =  2  or  greater,  or  the 
number  of  runs  of  plusses  of  length  1  or  greater.  This  expected 


number  is  when  n  ^  S+2  and  1/2  when  n  =  S+1. 

6 

A  run  up  and  a  run  down  commencing  with  the  i^^  observa¬ 
tion  are  mutually  exclusive  events.  Therefore  to  obtain  the  ex¬ 
pected  number  of  runs  up  or  down,  the  expected  number  of  runs 
up  should  be  doubled.  Variances  for  r\ins  of  either  plusses  or 
minuses  of  length  S,  or  of  length  S  or  greater,  have  been  given 
by  Levene  and  Wolfowitz  (7).  The  formulae  for  the  general  case, 
i.  e.  ,  with  S  a  variable,  are  lengthy.  However,  they  are  greatly 

2 

shortened  when  S  is  given  a  specific  value.  For  S  =  1,  cr  = 


305  n-347  „  o  J-  51 ,  1 06  n-73,  859  ^ 

^720 -  ■  ®  - 453760Cr -  ■  For  S  >  1, 

S  ^  2,  and  S  ^  3,  the  respective  variances  are:  —  ’ 


57  n-43  21,496  n-51,  269 

720  ’  453,600 


Consider  the  n  observations  rzinked  from  1  to  n  in  order 
of  increasing  magnitude.  There  are  n!  permutations  of  these  ranks, 
and  the  expected  number  of  runs  of  a  specified  type  is  simply  the 
total  number  of  such  r\ins  which  can  be  foiind  in  these  n!  permu¬ 
tations  divided  by  the  number  of  permutations,  n!  .  On  the  other 
hand,  the  probability  of  at  least  one  run  of  the  type  specified  is 
the  total  number  of  pe rmutations  in  which  such  a  r\in  can  be  found 
divided  by  the  number  of  permutations,  nl .  Therefore  the  prob¬ 
ability  and  expected  number  do  not  coincide  when  it  is  possible  for 
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more  than  one  run  of  the  specified  type  to  be  found  in  a  single 


permutation. 


However, 


when  S  > 


n-1 

2 


the  formulae  already  pre¬ 


sented  for  the  expected  number  of  runs  of  a  given  variety  also  give 
the  exact  probability  of  occurrence  for  such  runs.  This  appears 
to  be  the  only  situation,  when  dealing  with  runs  up  or  down,  in 
which  an  exact  probability  can  be  calculated  without  resort  to  a 
recursion  formula. 


2.  Length  of  Longest  Run  Up  or  Down 

Using  a  recursion  formula  Olmstead  (9)  has  calculated 
and  tabled  exact  probabilities  for  runs  of  like  difference  signs  of 
length  S  or  greater  when  2  <  n  <  14,  For  n  >  14  Olmstead  has 
tabled  approximate  probabilities  calculated  from  asymptotic 
formulae  (9,  13). 


3.  Total  Number  of  Runs  Up  and  Down 


The  total  number  of  runs  is  simply  the  number  of  runs 
of  plusses  or  minusses  of  length  1  or  greater,  and  this  was  shown 

in  Section  1,  Introduction,  to  have  an  expected  value  of and 
a  variance  of  when  n  is  greater  than  2.  The  total  num¬ 

ber  of  runs,  r,  is  asymptotically  normally  distributed  (6,  12),  so 
for  large  values  of  n  the  significance  of  the  total  number  of  runs 
can  be  tested  by  treating  r  as  a  normal  deviate  and  referring  the 


r  - 


2  n-1 


critical  ratio 


V 


16  n-29 


to  normal  tables.  By  reducing  the 


90 


absolute  value  of  the  numerator  by  1/2,  the  critical  ratio  can  be 
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corrected  for  continuity. 

If  the  total  number  of  runs  is  r,  then  the  series  has  re¬ 
versed  direction  r-1  times,  and  a  test  based  on  the  number  of 
^turning  points**  is  equivalent  to  one  based  on  the  total  number  of 

♦  2  n — 4 

runs.  The  expected  number  of  turning  points,  T,  is  — - —  and 

its  variance  is  the  same  as  that  for  the  total  number  of  runs.  There¬ 
fore  the  significance  of  the  number  of  turning  points  can  be  tested  by 
forming  the  critical  ratio  analogous  to  that  given  above,  referring  it 
to  normal  tables.  When  all  tests  concerned  are  applied  to  samples 
from  normally  distributed  populations  the  turning  point  test  has  an 
asymptotic  relative  efficiency  of  zero  with  respect  to  the  regression 
coefficient  test  and  also  with  respect  to  each  of  eight  distribution- 
free  tests  of  randomness  with  which  it  was  compared  (10,  11). 

See  Table  I  of  Introduction. 


4.  Chi  Square  Applied  to  Run  Frequencies 


The  expected  number  of  runs  of  plusses  or  minusses  of 
exactly  length  S  was  derived  in  Section  1,  Introduction,  and  found 

to  be  -^-1  +  (S  +3S+1)  expected  total 

(S+2);  (S+3): 

number  of  runs  of  plusses  or  minusses  of  all  lengths  was  found  to 
be  — - -  ,  the  former  result  requiring  that  n  >  S+3  and  the  latter 


being  contingent  upon  n  >  S+2.  However,  if  one  regards  the  first 
and  last  rvins  as  **incompleted**  and  counts  only  those  runs  which 


are  preceded  and  followed  by  at  least  one  run,  the  term 


4  (S+1) 
(S+2): 


in  the  first  formula  must  be  dropped  since  it  represents  the  first 
and  last  runs,  and  the  expected  total  number  of  runs  must  be  re- 
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duced  by  2.  Thus  the  revised  formulae  become 


2  (n-S-2)  (S^+3S+1) 
(s+3); 


and  '  ,  respectively.  Substituting  1  and  then  2  for  S  in  the 

3 

first  revised  formula,  the  expected  number  of  runs  of  plusses  or 

minusses  of  lengths  1  and  2  are  found  to  be  and 

^  12  60 

respectively.  Subtracting  these  two  values  from  the  expected  total 
number  of  runs  one  obtains  expected  number  of  runs 

of  plusses  or  minusses  of  length  greater  than  2. 


Wallis  and  Moore  (12,  8)  have  suggested  a  chi  square  test 
of  significance  applied  in  the  usual  way  to  the  observed  frequencies 
of  ’’interior**  runs  of  like  signs  of  lengths  1,  2  and  over  2,  with  the 


corresponding  expected  frequencies  being 


5  (n-3) 
12 


11  (n-4) 
60 


and 


4n  -21  . 

60 


There  are  2  degrees  of  freedom  one  degree  having  been 


expended  by  obtaining  n  from  the  sample.  The  test,  however,  is  an 
approximate  one  if  the  significance  of  the  calculated  chi  square  is 
obtained  from  the  usual  chi  square  tables.  This  is  the  case  because 
the  run  lengths  are  not  entirely  independent  of  one  another  although 
the  chi  square  test  assumes  that  they  are.  Various  empirically 
obtained  ’’corrections”  are  offered  by  the  authors  for  use  when  n 
exceeds  12.  However,  for  6  <n  5  12  they  have  provided  a  table 
of  exact  probabilities  for  the  values  of  chi  square  as  calculated 
from  the  sample.  These  were  obtained  by  means  of  a  recursion 
formula  and  give,  in  effect,  that  proportion  of  the  nj  permutations 
which  yield  a  value  of  chi  square  as  great  or  greater  than  the  one 
tabled. 


The  test  can  be  used  as  a  test  of  randomness  against  either 
trend  or  correlation  alternatives.  In  the  latter  application,  if  an  x 
measurement  and  a  y  measurement  have  been  taken  on  each  of  n 
objects,  the  objects  are  arranged  in  order  of  increasing  magnitude 
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of  one  continuously  distributed  variable  and  the  run  test  is  applied 
to  measurements  on  the  other  variable.  The  authors  point  out, 
however,  that  "the  conclusion  occasionally  depends  upon  which 
variate  is  chosen  for  arranging  in  order  and  which  for  counting  the 
phase  durations". 
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CHAPTER  X 


TESTS  BASED  ON  EXTREME  VALUES 


The  number  of  observations  in  a  second  sample  which  exceed 
(or  which  are  exceeded  by  and  therefore  included  within)  observa¬ 
tions  of  certain  size  rank  in  the  first  sample  can  be  predicted  if 
the  samples  are  drawn  from  a  common  population,  or  can  be  used 
to  test  the  hypothesis  of  a  common  population  if  the  "second  sample" 
has  already  been  drawn.  In  either  case,  the  probability  is  simply 
the  proportion  of  all  possible  arbitrary  reassignments  of  observa¬ 
tions  to  samples  in  which  the  specified  number  of  exceedances  is 
found  to  occur.  If  certain  assumptions  can  be  made,  the  tests  for 
identical  populations  become  tests  for  location,  dispersion  or  ex¬ 
treme  reaction.  An  analogous  but  different  mathematical  approach 
permits  the  setting  of  tolerance  limits. 
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1.  Exceedances:  Prediction 


a.  Rationale.  Suppose  that  a  sample  of  n  observations 
has  been  taken  from  a  continuously  distributed,  but  otherwise  un¬ 
known,  population  and  it  is  desired  to  know  the  probability  that 
N  or  more  observations  in  some  future  sample  of  size  m  will  ex¬ 
ceed  the  r^^  observation,  in  order  of  increasing  magnitude,  in  the 
already  obtained  sample.  For  convenience,  let  the  first  sample 
be  designated  X’s,  the  second  sample,  Y*s.  Since  the  two  samples 
are  defined  to  be  from  the  same  population,  the  X*s  can  be  considered 
as  a  random  sample  of  n  observations  "drawn”  from  the  n+m  obser¬ 
vations  comprising  the  two  samples.  Consider  the  sample  of  Y*s 
to  have  been  drawn  and  the  n+m  observations  in  the  two  samples  to 
have  been  arranged  in  order  of  increasing  magnitude,  irrespective 
of  sample,  and  labeled  Z’s  with  subscripts  indicating  rank: 


^r-l+m-N’  ^r+m-N’  •••’  ^n+m-l»  ^m+n' 


Con¬ 


sider  now  the  probability  of  drawing  an  "X  sample"  ofn  observations 
from  these  Z*s  so  as  to  leave  a  remaining  "Y  sample"  of  m  observa¬ 
tions,  N  of  which  exceed  the  r^^  X  in  order  of  magnitude  (and  m-N 
of  which  are  smaller  than  X^).  In  order  to  obtain  such  a  sample: 

(a)  we  must  draw  becomes  Xj.,  (b)  we  must  draw 

any  r-1  of  the  Z*s  smaller  than  which  there  are 

r+m-N-1,  and  (c)  we  must  draw  any  n-r  of  the  Z*s  greater  than 
^r+m  N  which  there  are  m+n-(r+m-N)  or  n-r+N.  There  is 


only  one  way  of  doing  (a),  but  there  are  ^  ^  )  ways  of  accom- 

r  - 1 

plishing  (b)  and  (  *  )  ways  of  fulfilling  requirement  (c).  There- 

n-r 

r  /r+m-N-1  X  /n-r+N  v  r  r  •  4.1. 

fore  there  are  (  )  (  )  ways  of  performmg  the  entire 

r-1  n-r 

operation.  Since  there  are  )  ways  of  drawing  the  X  sample, 

n 


without  these  restrictions  as  to  position,  the  probability  of  drawing 
an  X  sample  which  will  leave  N  of  the  remaining  observations  greater 


than  X  is 
r 


-1+m-N  n-r+N 
r-1  n-r  ^ 


^n+m 
n  ' 


and  this  is  the  probability  that  in  a 


future  sample  of  m  observations  from  the  same  population,  exactly 
N  observations  will  exceed  X^,  The  probability  of  at  least  N  ex- 
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ceedances  is  P  (Exceedances  ^  N)  =  ^ 


,r-l+m-i,  ,n-r+i, 
<  r-1  >  <  n-r  > 

i=N  ("+“) 

'  n  ' 


b.  Assumptions .  Random  sampling  and  no  tied  observa¬ 
tions,  The  latter  assumption  iff  met,  in  theory,  if  the  population 
is  continuously  distributed  and  measurements  are  precise. 


c.  Treatment  of  Ties,  Tied  observations,  if  their  pro- 
pottion  is  small,  become  a  practical  problem  only  if  is  tied 
with  other  observations.  Since  Y  observations  are  hypothetical, 
none  will  be  tied  with  X  .  If  X  is  tied  with  other  X*s,  calculate 
the  exceedance  probability  as  m^ny  times  as  there  are  X*s  tied 
with  X^,  each  time  letting  r  be  a  different  one  of  the  ranks  the 
members  of  the  tied  group  would  have  if  not  tied.  For  a  conser¬ 
vative  estimate,  use  the  smallest  or  largest  of  these,  whichever 
results  in  the  greater  conservatism,  as  the  probability  of  N  ex¬ 
ceedances.  If  it  is  desired  to  minimize  the  error,  use  the  average 
of  the  separately  calculated  probabilities. 


d.  Application.  Obtain  a  sample  of  n  observations  from 
the  population  in  question  and  rank  them  from  smallest,  1,  to 
largest,  n.  Letting  subscripts  indicate  rank,  the  ordered  ob¬ 
servations  will  be:  Xj,  X^,  ...,  X^,  X^  X^.  Treating 

ties  as  outlined  above,  the  probability  that  of  m  future  observations 
at  least  N  of  them  will  be  larger  than  X^,  the  magnitude  of  the  ob¬ 
tained  observation  whose  rank  is  r,  is  given  by  the  last  formula 
in  ’’Rationale". 


e.  Discussion.  The  formula  given  for  the  probability  of 
at  least  N  exceedances  over  X^,  of  course,  also  gives  the  prob¬ 
ability  that  m-N  or  fewer  future  observations  will  be  less  than  X  . 
The  formula  applies  to  exceedances  over  the  r^^  smallest  obser¬ 
vation  or,  since  the  r^^  rank  from  the  bottom  is  the  n-r+1^^  rank 
from  the  top,  to  the  n-r+1^^  largest  observation. 

The  point  probability  for  exceedances  can  be  evaluated  by 
use  of  binomial  tables.  Let  the  exceedance  probability  formula. 
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( 


r- 


1+m-N,  .n-r+N, 
r-1  ^  '  n-r  ’ 


( 


n+m 

n 


) 


n  m 

be  multiplied  by  — S - or,  equivalently, by 

p"  q”^ 


r-1  m-N  n-r  N 
p  q  p  q  p 

n  m 

p  q 


The  formula  thus  becomes 


r  ,r-l+m-N.  r-1  m-Nn  r  ,n-r+N.  n-r 

P  t  (  r-1  ^  P  ^  H  (  n-r  ^  P  'I  ^ 


[  ( 


n+m 

n 


Each  of  the  expressions  in  brackets  is  a  binomial  probability  and 
can  be  read  directly  from  binomial  tables.  The  values  p  and  q=l-p 
can  be  selected  arbitrarily  by  the  experimenter,  so  long  as  all 
p's  are  taken  to  be  the  same  exactly  tabled  value.  For  convenience 
take  p=q=l/2.  Then  (Exceedances  >  N)  = 

(1/2)^+^ 


(l/2)( 
i=N  - 


Point  Bin.  Pr.  (.  50,  r-  1+m-i,  r-  1)  Point  Bin.  Pr.  (.  50,  n-r+i,  n-r) 

Point  Bin.  Pr.  (.50, n+m,  n) 


f.  Tables.  Wilks  (37)  has  published  a  short  table  of  prob¬ 
abilities  for  exceedances  over  the  smallest  value  of  an  obtained  sample, 
i,  e.  r  =  1.  Epstein  (4)  has  tabled  exceedance  probabilities  for  the 
case  where  the  future  sample  is  to  be  equal  in  size  to  the  obtained 
sample,  i.  e,  m  =  n.  Rosenbaum  (21)  has  tabled  probabilities  for 
exceedances  over  the  largest  value  of  an  obtained  sample,  i.e.  r  =  n. 
Gumbel  and  von  Schelling  (10)  have  graphed  the  probability  of  one 
or  two  exceedances  over  the  largest  or  near-to-the-largest  X  value. 
Tables  and  graphs  are  also  to  be  found  in  (11)  and  (13).  Also  see 
"Discussion"  section  for  techniques  of  using  binomial  tables  to 
obtain  exceedance  probabilities. 
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g.  Sources^  (4,  10,  11,  13,  21  ,  36,  37) 


2.  Exceedances;  Tests  of  Hypotheses 

In  the  preceding  section  the  probability  was  determined  for 
at  least  N  exceedances  in  a  second,  future,  sample  from  the  same 
population.  If  the  second  sample  has  actually  been  obtained,  this 
same  probability,  derived  in  the  same  way,  is  the  a  priori  prob¬ 
ability  for  the  obtained  results  under  the  null  hypothesis  that  the 
two  samples  come  from  identical  populations. 

Exceedances  therefore  can  be  used  to  test  the  null  hypothesis 
that  two  samples  come  from  the  same  population  under  the  assumption 
that  the  population  is  continuously  distributed.  In  order  to  be  able 
properly  to  use  exceedance  probability  tables  to  test  this  hypothesis, 
must  be  designated  in  advance  of  sampling,  i.  e.  both  the  rank,  r, 
and  the  sample,  whether  X  or  Y,  determining  the  **reference  point*' 
for  exceedances  must  be  selected  in  advance. 

Rosenbaum  (21),  Epstein  (4),  and  Mathisen  (13)  have  all 
suggested  such  tests,  Rosenbaum  uses  exceedances  over  the 
largest  X  observation  as  his  test  statistic  and  has  provided  tables 
for  it.  Epstein  uses  exceedances  over  X^,  with  r  allowed  to 
assume  any  preassigned  value,  but  with  the  restriction  that  the 
two  samples  be  of  equal  size.  Tables  are  provided.  Mathisen 
takes  for  X  the  median  of  an  X  sample  containing  an  odd  number 
of  observations  and  provides  a  small  table  of  probabilities  for  the 
number  of  observations  in  a  second  sample  which  will  be  lower 
than  the  median  of  the  first.  All  three  tests  are,  in  effect,  based 
on  the  premise  that  if  the  two  samples  are  from  identical  populations 
the  expected  proportion  of  each  sample  above  some  arbitrarily  desig¬ 
nated  value,  X^,  should  be  the  same.  However,  while  identical 
populations  insure  that  the  proportion  of  each  population  above  X^ 
is  the  same,  the  reverse  is  not  true.  The  two  populations  can 
assume  widely  differing  forms  and,  so  long  as  their  cumulative 
distribution  functions  are  equal  at  the  point  X^  the  null  hypothesis 
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will  not  be  rejected  more  than  a  of  the  time.  The  above  tests  are 
therefore  not  consistent  except  for  such  classes  of  alternatives  as 
slippage,  i.  e.  ,  f(y)  =  f(x+c)  with  c  a  constant  (2). 

If  other  X  observations  are  tied  with  X^,  they  should  be 
treated  as  outlined  in  the  Preceding  section.  Y  observations  tied 
with  X^  can  be  "assigned"  positions.  For  a  conservative  test, 
count  all  Y  observations  tied  with  X^  as  falling  on  whichever  side 
of  X^  which  will  be  least  conducive  to  rejection  of  the  null  hypothesis. 
To  minimize  error,  half  may  be  assigned  above,  half  below  X^,  an 
odd  tied  observation  being  treated  "conservatively". 

For  a  test  at  significance  level  a,  reject  if 


,r-l+m-i.  ,n-r+i. 

(  1  )  (  ) 

■S-T  '  r  - 1  '  '  n-r  ' 

- 

/Xi.  I  m V 

'  n  ' 


<  for  a  one-tailed  test  against  the  alter¬ 


native  hypothesis  of  excessive  exceedances,  i.  e.  ,  that  the  propor¬ 
tion  of  values  in  the  Y  population  which  are  greater  than  exceeds 
the  proportion  of  the  X  population  which  is  greater  than  X^.  For 
a  two-tailed  test  in  which  the  alternative  hypothesis  is  "either  too 
many  or  too  few  exceedances",  reject  the  null  hypothesis  if  either 
the  above  summation  or  the  summation  taken  from  i*0  to  i  =  N  is 


less  than  —  . 

2 


This  type  of  significance  test  is  particularly  useful  when 
experimentation  is  costly  in  terms  of  time  or  material.  All  m 
of  the  Y  observations  need  not  necessarily  be  taken,  since  the 
null  hypothesis  can  be  rejected  whenever  the  number  of  exceedances 
among  the  Y*s  reaches  a  certain  value  (determined  by  n,  m,  r  and 
a  ).  The  test  is  especially  appropriate  for  life  testing  since  the 
experiment  need  last  only  long  enough  to  identify  X^  and  for  the 
number  of  exceedances  to  reach  the  rejection  criterion  (3). 
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3.  Includances:  Prediction 


a.  Rationale.  Let  the  first  sample  from  a  continuously 

distributed  but  otherwise  unknown  population  be  arranged  in  order 

of  ascending  magnitude  as  follows:  X-  ,  X  ,  .  ,  .  ,  X  ,  .  .  .  ,  X  ,  .  .  .  , 

12  ^  s 

Xn  p  (r  and  s  being  ranks  which  can  be  assigned  any  integral 

value  from  1  to  n  so  long  as  s  is  greater  than  r).  The  following 
derivation  will  obtain  the  probability  that  N  observations,  in  a 
future  sample  of  m  observations,  designated  as  Y*s,  will  lie 
within  the  range  of  magnitudes  whose  endpoints  are  X  and  X  . 

Consider  the  second  sample  to  have  been  drawn  and  let  the  n+m 
observations  be  arranged  in  order  of  increasing  magnitude,  irre¬ 
spective  of  sample,  and  labeled  Z*s  with  subscripts  indicating 
rank:  Z^,  Z^,  •  •  •  .  ^ .  .  .  ,  .  .  .  Z^^^^  The  a  priori 

probability  that  exactly  L  of  the  Y  observations  are  smaller  than 
X^,  N  are  between  and  X^,  and  m-L-N  above  Xg  is  the  prob¬ 
ability  of  drawing  the  X  sample  so  as  to  consist  of  r-1  of  the  Z's 

below  Z  ,T,  Z  ,T,  s-l-r  of  the  Z's  between  Z  ,  ^  and  Z  .  j 

r+Li,  r+L  r+L  s+ij+N 

^s+Li+N’  above  This  probability  is 


.r-l+Li.  .1.  .s-l-r+N.  ,1.  .n+m-s-L-Nv 
^  r-1  '  ^  s  -1-r  ’  U'  '  ’ 


n-  s 


.n+m. 

'  n  ' 

This  probability  contains  the  restriction  that  exactly  L  of  the 

m-N  Y's  outside  of  the  X^  to  X^  range  shall  fall  below  X^.  In  order  to 

remove  this  undesired  restriction,  the  probability  must  be  summed 
over  all  of  the  values  from  0  to  m-N  which  L  can  assume  without 
changing  N.  The  probability  that  exactly  N  of  the  Y's  will  fall  be¬ 
tween  X^  and  X^  is  therefore 
r  s 


.r  -  1+Lv  .s-  1-r+N  .  m+m-  s-L-N. 
m-N  r-1  ^  s-l-r  n-s 


E=o 


,n+m* 
'  n  ' 


and  the  probability  that  N 


or  more  Y*s  will  fall  between  X^  and  X^  is 
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^m-1 

&0 


/T-l+L,.  ,s-l-r+i.  ,n-s+m-L,-i. 
^  -  1  ^ M  ) 


r-1 


s-  l-r 

"TTT 


b.  Assumptions.  See  1,  Exceedances:  Prediction 

c.  Treatment  of  Ties.  See  1.  Observations  tied  with 

Xg  should  be  treated  separately  but  in  the  same  way  as  observations 

tied  with  X  . 

r 

d.  Application.  Last  formula  in  "Rationale"  gives  required 
probability. 


e.  Discussion.  The  probability  that  N  or  more  Y's  will 
fall  between  X^  and  X^  is,  of  course,  also  the  probability  that  m-N 

or  less  of  the  Y*s  will  fall  outside  the  interval  botuided  by  X  aind  X  . 

'  r  3 


This  probability  can  be  expressed  in  terms  of  several  binomial 
probabilities.  It  becomes 


m  „m-i 

iIn  n=o 


2r,r-l+L.  r-1  Lir/S-l-r+i,  s-l-r  i,  r  ,n- s+m-L-i.  n-s  m-L-i 
P  [(  )P  q  ][(  s-l-r  ‘IjtC  n-s  ^ 


r  ,n+m.  n  m, 

[(  n  ^P  ‘I  J 


each  of  the  bracketed  expressions  being  obtainable  from  tables  of  the 
point  binomial.  The  parameters  p  and  q  are  chosen  arbitrarily.  For 
p  =  q  =  l/2,  the  double  summation  becomes 


„m  m 

Mn  E=0 


s-l-r 


[O  (1/2)“-''“] 


s 


1 

Even  with  the  help  of  binomial  tables  this  probability  is  not  quickly 
evaluated.  By  careful  choice  of  the  parameters  n,  m,  r,  s,  N,  L, 
the  formula  can  be  considerably  simplified.  Without  such  simplifi¬ 
cation  the  method  will  prove  practical  only  when  n  and  m  are  quite 
small  or  when  tables  of  probabilities  for  N  are  available. 


240 


Tables^  Wilks  (37)  has  published  a  small  table  for  the 

case  where  r  =  l,  s*n.  Rosenbaum  (20)  has  produced  an  extensive 

table  for  the  same  case,  but  in  terms  of  the  probability  for  m-N 

Y  values  outside  of  the  interval  X,  to  X^.  Moses  (14)  has  also 

i  n 

published  a  small  table  for  certain  cases  where  s=n-r+l.  Binomial 
tables  can  also  assist  in  evaluating  probabilities.  See  (e). 

g.  Sources .  (14,  20,  36,  37) 


4.  Includances:  Tests  of  Hypotheses 

K  the  second  sample  has  actually  been  obtained,  the  prob¬ 
ability  derived  in  the  preceding  section  can  be  used  to  test  the  null 
hypothesis  that  the  two  samples  are  from  the  same  continuously 
distributed  population.  The  values  n,  m,  r,  s  and  a  must,  of 
course  be  selected  in  advance  of  sampling,  which  must  be  random. 
Rosenbaum  (20)  proposes  includances  as  a  test  of  equal  dispersions 
for  two  populations  known  to  have  the  same  median.  He  has  pro¬ 
vided  extensive  probability  tables  for  the  number  of  Y*s  which  fall 
outside  of  the  interval  whose  endpoints  are  X^  and  X^.  If  medians 

are  not  known  to  be  equal,  his  test  becomes  a  test  for  identical 
populations.  Moses  (14)  uses  includances  to  test  the  null  hypothesis 
that  an  experimental  and  a  control  group  belong  to  the  same  popula¬ 
tion  against  the  alternative  hypothesis  that  the  treatment  to  which 
the  experimental  group  is  subjected  tends  to  increase  the  scores  of 
some  individuals  and  reduce  those  of  others  ('’defensive  responses"). 
Moses  takes  as  his  test  statistic  the  number  of  X's  equal  to  or  in¬ 
cluded  between  X  and  X  ,  ,  plus  the  number  of  Y's  included  be- 

r  n-r+i  ^ 

tween  these  endpoints.  Since  the  number  of  X's  in  this  interval 
is  predetermined,  the  probability  for  the  obtained  statistic  is  the 
same  as  the  probability  for  the  number  of  Y  includances.  A  small 
table  of  probabilities  is  given. 


X  scores  tied  with  X  and  X  scores  tied  with  X  should  be 

r  s 

dealt  with  separetely  but  in  the  same  way  as  outlined  under  1.  Ex¬ 
ceedances:  Prediction,  for  observations  tied  with  X  .  Y  scores 

r 

tied  with  X  and  Y  scores  tied  with  X  should,  for  a  conservative 
r  s 
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test,  all  be  counted  as  falling  within  or  outside  the  X  to  X  interval, 

r  s 

whichever  is  least  conducive  to  rejection  of  the  null  hypothesis.  If  it 

is  desired  to  minimize  the  error,  half  of  Y's  tied  with  X  should  be 

r 

counted  as  falling  inside  the  interval,  half  outside,  and  likewise  for 

Y's  tied  with  X  ,  odd  tied  Y's  being  dealt  with  conservatively, 
s 

For  a  one-tailed  test  of  the  null  hypothesis  of  identical  popu¬ 
lations  against  the  alternative  of  excessive  includances,  reject  at  the 
level  a  if 


i=N 


^m-i 

&0 


.r-l+L.  s-l-r+i.  .n-s+m-L-i. 
^  r-1  s-l-r  n-s  ^ 

-n+m. 

'  n  ' 


<  o 


If  the  alternative  is  too  few  includances,  reject  at  the  level  a  if  the 
double  summation  equals  or  exceeds  1  -  a.  For  a  two-tailed  test, 
reject  at  the  level  o  if  the  double  summation^  ^2  ^  The 


above  formula  is  valid  for  the  desired  probability  only  if  previous 
to  sampling  it  is  specified  which  sample  is  to  be  the  X  sample  and 
which  the  Y  sample.  The  values  of  r,  s,  n,  m,  and  a  must  also 
be  decided  upon  before  the  samples  are  obtained. 


If  r  and  s  are  taken  to  be  1  and  n  respectively  so  that  the 
interval  is  that  included  between  the  smallest  and  largest  X  obser¬ 
vations,  the  probability  is  greatly  simplified.  The  first  and  last 
combinatorial  expressions  in  the  numerator  become  1-  Summing 
from  L.=0  to  Ij=m-i,  therefore,  amounts  simply  to  multiplying 


( 


s-  l-r+i. 
s-l-r  ’ 


.n+m. 
'  n  ' 


or 


.n-2+i 
^  n-2 
n+m 
n 


by  m-i+1.  The  probability  that  N  of  m 


Y  observations  will  fall  within  the  endpoints  of  a  sample  of  n  X 
observations  from  the  same  continuously  distributed  population 
thus  becomes 

(  n-2  ^  mj  n(n-l)  _m  (n-2+i)]  (m-i+1) 

i=N  .n+m.  {n+m)i  i=N  TI  * 

'  n  ' 
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5.  A  Univariate  Tolerance  Limit 


a.  Rationale.  While  confidence  limits  specify  a  region 
within  which  a  population  parameter  is  inferred  to  lie,  tolerance 
limits  enclose  a  region  within  which  a  specified  proportion  of  the 
entire  population  is  inferred  to  exist. 

Let  a  sample  of  n  observations  be  taken  from  a  continuously 
distributed  population,  f(x),  and  arranged  in  order  of  increasing  mag¬ 
nitude  with  subscripts  indicating  rank  in  that  order.  The  proportion 
of  the  unknown  parent  population  which  is  smaller  than  X^,  the  r^^ 

r^r 

smallest  sample  observation,  is  \  f(x)  dx  or  F(x  ), the  small 

-'O  ^ 


case  X  indicating  the  same  value  as  X,  but  located  in  the  population 
rather  than  the  sample.  F  (x^)  is  therefore  the  probability,  P,  of 

a  sample  observation  being  less  than  x^.  The  a  priori  probability 

that  in  a  random  sample  of  size  n,  r-1  observations  will  fall  below 
x^,  one  observation  at  x^,  and  n-r  observations  above  x^  is  given  by. 

the  multinomial  law  for  partitions:  , - - r  [  F(x  )]  ^  ^ 

(r-l);  ll  (n-r)J  r 

•  [  l-F(x^)]^"^  [f(^^)dx^]  .  Substituting  P  for  F(x^),  this  becomes 


_ nj _ 

(r-l)!  (n-r)! 


(l-P)^"''  dP. 


This  states  the  probability  that  the  r^^  ordered  sample 
observation  occupies  the  area  of  the  population  distribution  curve 
(i.  e  density  function)  whose  ordinate  is  f(x^)  and  whose  base  is  dx  . 
Equivalently,  it  is  the  probability  that  exactly  a  proportion  P*F(Xj.) 
of  the  parent  population  lies  below  x^.  By  integrating  from  P=X. 
to  P  =  1  we  obtain  the  probability  that  a  proportion  \  or  more  of  the 
parent  population  lies  below  the  r^^  smallest  sample  observation. 
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Thus 


r 


n] 


rr-'l)i'  (n-ry. 


dP  gives  the  desired  probability. 


This  can  be  evaluated  by  means  of  tables  of  the  incomplete  beta 
function  since 


^-V  F-Diln-r);  P  ■(r-l)r(n-r)i  ^  d-I’) 


r(n-j-l) 

r  (r)r(n-r+l) 


(1-P)^'^  dP  =  (r,  n-r+1). 


The  probability  sought  is  therefore  (r,  n-'r+l),  or  if  tables  of  the 

incomplete  beta  fvmction  are  not  available,  binomial  tables  can  be 

used  since  (r,  n-r  +  1)  =  (^)  (l-\)^  ^  . 

By  obvious  symmetry  the  probability  that  a  proportion  \  of  the  popu¬ 
lation  lies  below  the  r^^  smallest  sample  observation  is  also  the  prob¬ 
ability  that  a  proportion  X  of  the  population  lies  above  the  r^^  largest 
sample  observation,  i.  e.  ,  the  n-r+1^^  ordered  observation. 

b.  Assumptions .  Random  sampling  from  a  continuously 
distributed  population.  The  latter  assumption  was  implicitly  intro¬ 
duced  in  the  derivation  when  the  probability  of  an  observation  above 
Xj.  was  taken  as  one  minus  the  probability  of  an  observation  below 
Xj..  This  leaves  the  probability  for  an  observation  equal  to  x^.  to 
be  zero  which  is  the  case  only  if  f(x)  is  continuous  in  the  region  of 


c.  Treatment  of  Ties.  Ties  are  problem  only  if  they 
involve  the  r^^  ordered  sample  value.  In  this  case,  if  the  propor¬ 
tion  of  tied  observations  is  small,  one  of  the  following  treatments 
may  be  employed.  Take  a  new  r,  r'  which  refers  to  the  middle 
ordered  observation  in  the  tied  group  to  which  the  old  x^  belonged, 
and  calculate  X  using  r^  and  Xj./instead  of  r  and  x^..  Alternatively, 
calculate  X  for  each  of  the  ordered  observations  tied  with  x^.  and 
either  use  the  average  X  ,  or  the  most  "conservative"  X  . 
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d.  Application,  Decide  upon  the  values  to  be  used  for  r  and 
n  prior  to  sampling,  TKen  take  a  sample  of  n  observations  from 
the  population  in  question,  arrange  them  in  order  of  increasing 
magnitude  and  select  the  r^^  ordered  sample  value.  If  it  is 

desired  that  the  tolerance  level  is  to  be  1-a  that  a  proportion  X 
or  more  of  the  parent  population  lies  below  x^,  solve 


nl 

\  (r-l)J  (n-r): 


(1-P)^"^  dP  >  1-a  for  X. 


This  can  be  ac¬ 


complished  simply  by  referring  to  tables  of  the  incomplete  beta 
function  or  to  tables  of  the  cumxilative  binomial  (See  Rationale). 
Actually,  if  any  three  of  the  values  n,  r,  X,  a  are  preselected, 
the  fourth  can  be  found  by  solving  the  above  formula. 

e.  Discussion,  There  is  an  element  of  inaccuracy  in 
this  method  of  obtaining  tolerance  limits.  The  derivation  is  based 
on  the  formula 


nl 

(r-l)J  11  (n-Tyr 


[  F(x^)]^‘^  [  l-F(x^)]^'^  f(x^)  dx^ 


in  which  the  ’*event*\  one  observation  in  the  region  dx^,  is  (a) 
given  a  probability  of  occurrence,  f(x^)  dx^,  which  must  be  zero 
since  the  probabilities,  F(x^)  and  l-F(x^),  for  the  other  two  multi¬ 
nomial  categories  together  equal  1,  (b)  is  regarded  as  having  oc¬ 

curred  once  in  n  trials.  The  occurrence  in  a  finite  number  of 
trials  of  a  predesignated  event  with  zero  probability  is,  of  course, 
implausible.  The  ambiguity,  and  inexactitude,  result  from  the 
mixture,  in  the  same  formula,  of  terms  implying  a  discrete  dis¬ 
tribution,  i.e.  the  multinomial,  with  terms  relating  only  to  a  con¬ 
tinuous  distribution,  i.  e.  ,  f(x^)  dx^.  The  net  result  is  inaccuracy 
in  the  order  of  dx^,  or,  in  more  practical  terms,  the  distance  be¬ 
tween  successive  ordered  observations,  namely  x^  and  The 

error  therefore  should  be  between  zero  and  Since  a  sample 

of  n  observations  randomly  divides  its  population  distribution  into 
n+1  intervals  each  of  which,  on  the  average,  contains  a  proportion 

of  the  population,  the  error  in  X  would  not  be  csxpected  to  ex- 
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1 


See  the  section  on  confidence  limits  for  quantiles 


ceed 

n+1 

for  a  similar  discussion. 

f.  Tables,  The  required  probabilities  can  be  obtained 
from  tables  of  the  incomplete  beta  function  (17,  26),  by  special 
use  of  tables  of  the  cumulative  binomial  (25)  or,  for  the  case  where 
r  =  l,  directly  from  a  small  table  prepared  by  Wilks  (37). 

g.  Sources.  (11,  15,  17,  23,  24,  25,  26,  36,  37) 


6.  Univariate  Tolerance  Limits 


Let  a  sample  of  n  observations,  capital  X*s,  be  drawn 
from  a  continuously  distributed  but  otherwise  unknown  population 
f(x)  and  arranged  in  order  of  increasing  magnitude  X^,  X^,  .  .  .  , 

X,...,X,...,X  ,X.  These  n  ordered  observations 

r  s  n-1  n 

divide  the  unknown  population  from  which  they  came  into  n+1  in¬ 
tervals  :  -CO  to  X  ,  X  to  X  ,  ,  .  ,  ,  X  ,  to  X  ,  .  .  .  ,  X  ,  to  x  ,  .  ,  ,  , 
11  2  r-1  r  s-1  s 

X  to  X  ,  X  to  X  ,  X  to  +00,  small  case  x*s  denoting  the 
n-2  n-1  n-1  n  n 

same  magnitudes  as  the  large  case  X*s,  but  magnitudes  located 
in  the  parent  population,  not  the  obtained  sample. 

The  probability  of  drawing  an  observation  smaller  than 
some  value  is  simply  F(x^),  the  cumulated  probability  for  values 
of  X  less  than  x^.  This  F(x^)  is  known  to  have  a  uniform  distribu¬ 
tion  from  0  to  1 ,  so  that  its  probability  is  the  same  for  every  x^, 
i.  e.  ,  is  independent  of  i,  (See  Mood  (I  pp.  107-108)  for  proof). 

And  since  the  probability  for  F(x^)  is  independent  of  i,  the  prob¬ 
ability  for  F(x^)  -  F(x^  j)  is  independent  of  i.  However,  this  is 
the  proportion  of  the  parent  population  within  the  interval 
to  x^.  Therefore  the  proportion  of  the  parent  population  to  be  en¬ 
closed  between  successive  ordered  sample  observations  is  inde¬ 
pendent  of  the  rank  of  the  observations. 

Stated  slightly  differently,  each  of  the  n+1  intervals  has 
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exactly  the  same  probability  of  enclosing  any  given  proportion  of 
the  parent  population.  In  the  last  section  the  probability  that  a 
proportion  X  or  more  of  the  parent  population  lies  below  was 
found.  Since  there  are  r  intervals  below  X^,  the  derived  prob¬ 
ability  is  also  the  probability  that  a  proportion  X  or  more  of  the 
population  lies  in  any  preselected  r  intervals  between  successive 
ordered  sample  values.  It  is  therefore  the  probability  that  X 
or  more  of  the  population  lies  between  X^  and  if  the  values 

i  and  r  are  selected  prior  to  sampling. 

For  assumptions,  application,  etc.,  see  the  preceding 
section.  Tied  observations  are  a  problem  if  they  include  either 
X^  or  If  there  is  one  such  group  of  ties,  they  should  be 

dealt  with  as  indicated  in  the  preceding  section.  If  X^  and 
are  both  members  of  tied  groups,  each  group  should  be  treated 
separately,  but  in  a  fashion  analogous  to  that  outlined  previously. 
For  sources,  see  (1,  15,  17,  22,  23,  24,  25,  26,  35,  36,  37), 


7 ,  Multivariate  Tolerance  Limits 

Ingenious  methods  of  setting  tolerance  limits  for  multi¬ 
variate  distributions  have  been  discussed  by  Wald  (34),  and  others 
(5-9,  32,  33).  For  the  bivariate  case  Wald  selects  four  integers 
a,  b,  c,  d  before  sampling  n  observations  from  a  continuously 
distributed  bivariate  population.  After  obtaining  the  sample,  he 
discards  the  a  observations  with  the  smallest,  and  the  b  obser¬ 
vations  with  the  largest,  x  values;  then,  of  the  remaining  n-a-b 
observations,  he  discards  the  c  observations  with  the  smallest, 
and  the  d  observations  with  the  largest,  y  values.  The  tolerance 
region  is  the  rectangle  bounded  by  the  a^^  smallest  and  the  b^^ 
largest  X  and  by  the  c^^  smallest  and  the  d^^  largest  of  the  n-a-b 
Y’s  between  the  X  boiindaries.  Tukey  (32,  33)  has  generalized 
the  method  of  ’*cuts”  by  which  the  tolerance  region  is  obtained  and 
has  extended  the  applicability  of  the  method  to  discontinuously  dis¬ 
tributed  populations.  Fraser  (5,  6,  7)  has  further  developed  the 
method  so  that  instead  of  a  predetermined  method  of  making  cuts, 
each  cut  can  be  made  in  a  manner  determined  by  the  outcome  of 
previous  cuts.  For  details  of  application,  see  the  referenced 
articles. 
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CHAPTER  XI 


TESTS  BASED  ON  THE  MAXIMUM  DEVIATION  BETWEEN 
TWO  CUMULATIVE  DISTRIBUTIONS 


If  the  cumulative  distribution  for  an  obtained  sample  and 
either  (a)  the  cumulative  distribution  of  the  population  from  which 
it  was  drawn,  or  (b)  the  cumulative  distribution  for  a  second 
sample  from  the  same  population,  are  plotted  on  the  same  graph, 
the  maximum  deviation  between  the  two  cumulatives  will  be  inde¬ 
pendent  of  the  form  of  the  sampled  population.  Its  probability 
fraction  can  be  obtained,  however;  therefore  the  maximum  devia¬ 
tion  can  be  made  the  test  statistic  for  distribution-free  tests  of 
goodness  of  fit  or  tests  of  whether  two  samples  were  drawn  from 
identical  populations.  By  confining  the  test  to  the  lower  portion 
of  a  sample  cumulative,  the  test  can  be  made  especially  efficient 
for  life  testing.  Tables  of  probabilities  for  the  maximum  devia¬ 
tion  can  be  used  to  set  confidence  bands  for  the  cumulative  distri¬ 
bution  of  the  sampled  population. 
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1 .  Maximum  Deviation  Tests  for  Goodness  of  Fit  to  an 
Hypothesized  Population 


a.  Rationale.  Let  F(x)  be  the  true  population  cumulative 
distribution  of  x  and  let  F(x)  be  plotted  as  ordinate  against  x  as 
abscissa.  Now  suppose  that  a  sample  of  n  observations  is  drawn 
from  the  x  population  and  that  the  sample  cumulative  distribution 
Sn(x)  has  been  plotted  on  the  same  graph  with  F(x),  Thus  Sn(x) 
is  a  step  function  which  rises  in  steps  of  1/n  or  multiples  thereof. 
Let  d  be  the  maximum  ordinatewise  deviation  between  the  smooth 
curve  F(x)  and  the  step  function  Sn(x).  It  has  been  proven  (23,  43) 
that  the  probability  of  d  taking  any  specified  value  is  independent  of 
the  form  of  F(x)  so  long  as  F(x)  is  continuously  distributed.  This 
can  be  seen  as  follows.  The  probability  that  an  observation  drawn 
from  the  x  population  will  be  below  some  value  x.  is  simply  F(x-), 
the  value  of  the  cumulative  distribution  at  the  point  x-.  The  prob¬ 
ability  that  exactly  r  observations  in  a  sample  of  n  observations 

will  lie  below  x^  is  (^)  [  F(x^)]  ^  [  1  -  F(x^)]^"^.  And  if  this  occurs, 

a  proportion,  r/n,  of  the  sample  has  fallen  below  x^,  and  this  is 
the  ordinate  of  the  sample  cumulative  step  function,  Sn(x),  at  the 


abscissa  x^.  Therefore  (^)  [F(x^)]^  [  1  “  F(x^)]^  Ogives  the  prob¬ 


ability  that  the  difference  in  ordinates  between  the  population  cum¬ 
ulative  distribution  and  the  sample  cumulative  step  function  will 
be  F(x^)  -  r/n  at  the  abscissa  point  x^^.  Let  F(x.)  -  r/n  =  c. 


Then  F(xj)  “  ^  +  c  and  the  a  priori  probability  that  F(x^)-Sn(x^)  =  c 


IS  (  ) 

r 


L-i_  +c]^  [l-l-c]^-^ 

n  n 


The  latter  expression  depends 


only  upon  c,  n  and  r  of  which  the  former  is  a  constant  specified 
in  the  probability  statement  and  the  latter  two  are  parameters  of 
the  sample,  not  of  the  population.  Therefore  the  probability  that 
F(x^)  -  Sn(x.)  =  c  is  independent  of  F(x),  i.  e.  ,  is  independent  of 
the  form  of  the  distribution  of  the  parent  population.  This  is  ob¬ 
viously  true  for  any  value  of  c,  and  since  x.  was  chosen  arbitrarily 
it  is  also  true  for  any  value  of  x.  Thus  the  probability  that  the 
maximiim  absolute  deviation  equals  or  exceeds  d,  i.  e.  , 

Pr  (max  |  Sn(x)  -  F(x)  j  >  d),  is  independent  of  the  form  of  F(x) 
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so  long  as  F(x)  is  continuously  distributed. 


Therefore,  if  the  probability  of  d  can  be  derived  by 
assuming  that  x  has  a  uniform  distribution,  the  result  can  be 
generalized  to  any  continuous  distribution.  Following  this  ap¬ 
proach,  let  X  have  a  uniform  distribution  with  range  from  0  to  1, 
Then  F(x)  ■  x,  and  F(x)  is  a  line  of  constant  slope  rising  from 
an  ordinate  of  zero  to  an  ordinate  of  1.  Now  divide  the  popula- 
tioan  range  of  xs  into  n  equal  abscissa  intervals.  Since  F(x)  is 
a  line  of  constant  slope,  each  of  the  n  equal  abscissa  intervals 
contains  the  same  proportion,  1/n,  of  the  population.  Let  n^^, 

n2>  •  •  .  be  the  obtained  number  of  sample  values  falling  in 

the  first,  second  ,  .  ,  n^^  interval.  The  expected  proportion  of 
sample  values  falling  in  any  given  interval  is,  of  course,  1/n, 
Therefore  the  a  priori  probability  of  the  obtained  results  is  given 


by  the  multinomial  and  is 


n  I 
n 


1 

R  ^ 

n 


1  "^2 

,i) 


n 

n 


or 


n: 


n.  i  n^J  .  .  .  n  J 
12  n 


(if. 

n 


This  is  the 


probability  of  a  specified  pattern  of  interval-occupancy,  n^^ 


n 


2’ 


,,,  n^.  Corresponding  to  each  pattern  of  interval-occupancy 
is  a  pattern  or  set  of  ordinate  differences  at  interval  end  points: 

At  the  end  of  the  first  interval  the  ordinate  of  F(x)  is  1/n  and  that 
of  Sn(x)  is  n  /n;  at  the  end  of  the  second  interval  the  ordinate  of 
1  Hi  n2 

F(x)  is  2/n  and  that  of  Sn(x)  is  -  ,  etc.  The  probability 

of  the  pattern  of  interval-occupancy  is  therefore  equally  the  prob¬ 
ability  of  the  set  of  n-1  ordinate  differences.  Therefore  by  ex¬ 
amining  all  possible  patterns  of  interval-occupancy,  selecting 
those  for  which  the  corresponding  set  of  ordinate  differences  con¬ 
tains  an  ordinate  difference  of  d  or  greater,  and  summing  the  prob¬ 


abilities, 


ni 


^1-  ^2* 


1  n 

(  —  )  ,  associated  with  these  critical 
n 


d's,  one  obtains  the  probability  that  at  one  of  the  abscissa  points, 
1/n,  2/n,  ...  i/n,  ...  n/n,  the  ordinate  difference  between  F(x) 
and  Sn(x)  will  equal  or  exceed  d.  See  Figure  3. 
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Sti(5c)  at  interval  ENDPOINTS)  •  P»  (CORRESPONDING  SET  OF  n;  ) 

THUS,  Pr(s^(x)  >  F(s:)  +  d  AT  AN  ENDPOINt]  •  I  v^ri-.TU  (+) 

TAKEN  OVER  ALL  SETS  OF  n;  IN  WHICH  Sn(^>  F(-X)  +  d  AT  AN  ENDPOINT. 


P*  (ANY  GIVEN  SET  OF  ORDINATES  FOR 
THE  LATTER  IS  tt,|  ■  Ty 


FUNCTIONS 

OF 

X 


Figure  3, 
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P,  {Sr,(x)  >  F(-x)  +  d  WITHIN  THE  ITH  INTERVAL,  GIVEN  THAT  SnW<F(x)  +  d  AT  THE  INTERVAL  ENDPOINTS}  • 

WHERE  •  PROPORTION  OF  INTERVAL  WIDTH  OUTSIDE  OF  FM  +  d  AT  THE  HIGHEST  ORDINATE  TAKEN 
BY  S„(x)  WITHIN  THE  iTH  INTERVAL  AND  n;  •  NUMBER  OF  SAMPLE  UNITS  IN  THE  iTH  INTERVAL 

I  ^  ^ 

®/t1 

^■n 

FUNCTIONS  4, 

OF 

X 

'/n 

0 

°  ^/n  %  %  %  %  % 

X 


Figure  4. 
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However,  a  maximum  absolute  deviation  of  d  or  greater  can 
occur  within  an  interval  without  also  occurring  at  its  beginning  or  end. 
F(x)  rises  an  ordinate  distance  of  1/n  from  the  beginning  to  the  end  of 

an  interval.  Therefore,  if  S  (x)  is  greater  than  F(x)  +  d  -  —  but  less 

n  n 


than  F(x)  +  d,  at  the  upper  endpoint  of  an  interval,  exceed 

F(x)  +  d  within  the  interval.  Whether  it  does  so  or  not  depends  upon 

the  abscissa  at  which  S  (x)  rises  to  its  greatest  ordinate  within  the 

n 

interval.  (The  greatest  ordinate  is  1/n  greater  than  the  next-to- 

greatest  ordinate.)  In  order  for  S  (x)  to  exceed  F(x)  +  d,  it  must 

n 

assume  its  maximum  ordinate  before  F(x)  +  d  exceeds  that  ordinate. 

Let  a  horizontal  line  be  drawn  across  the  width  of  the  interval  at 
the  highest  ordinate  taken  by  within  the  interval,  and  let  K  be 

that  part  of  the  horizontal  line  which  lies  outside  of  the  confidence 

band,  F(x)  +  d.  In  order  for  S  (x)  to  exceed  F(x)  +  d,  its  maximum  or- 

n 

dinate  must  overlap  with  K,  and  for  this  to  happen  all  n^  must  have 

abscissae  beneath  K.  If  p.  is  the  proportion  of  the  interval  width  f  ep- 

i 

resented  by  K,  then  the  probability  that  a  randomly  selected  one  of 
the  n^  will  lie  below  K  is  p^,  and  the  probability  that  all  n.  units  will  lie 

below  K  is  p  This  is  the  probability  that  when  F(x)  +  d  -  —  <S  (x)  < 

i  n  n 

F(x)  +  d  at  the  upper  endpoint  of  the  interval,  S  (x)>  F(x)  +  d  within 

^  1 

the  interval.  See  Figure  4,  Similarly,  if  F(x)  -  d  <.  S^(x)  <  F(x)-d+  — at 

the  lower  endpoint  of  the  i^^  interval,  let  a  horizontal  line  be  drawn  across 
the  width  of  the  interval  at  the  lowest  within-interval  ordinate  of  Sj^(x), 

let  Li  be  that  part  of  the  line  lying  below  F(x)  -  d,  and  let  ph  be  the  pro¬ 
portion  of  the  interval  width  represented  by  L,  The  probability  that 


(n- 1 

Sj^(x)  <;  F(x)  -  d  within  the  interval  is  ph'  i'  and  since  K  and  L  cannot 

have  any  abscissae  in  common,  p/^i^  and  p'/^i^  are  mutually  exclu¬ 
sive  probabilities  and  can  be  added.  Let  be  the  sum  of  these  two 

probabilities.  Now  consider  those  so  far  uncounted  patterns  of  interval 
occupancy  at  all  of  whose  endpoints  |  F(x)  -  S^(x)  j<  d  and  at  some  of 

whose  endpoints  I  F(x)  -  S  (x)  j  >  d  -  — .  The  probability  of  e^ch  such 

1  II  I 
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pattern  is 


nl 


n^J  n^J  •••  n 
1  Z  n 


—  as  before. 

I  n 


given  such  a  pattern  of  interval-occupancy, 


The  probability  that, 
F(x)  -  S^(x)  (  >  d  within 


an  interval  depends  not  only  upon  the  values  of  the  Q^,  but  upon  the 
number  of  nonzero  Q^.  If  there  is  only  one  nonzero  Q.,  then  the 


I 

probability  j  F(x) 


S  (x) 
n 


>  d  will  be  that  Q.  times 


ni 


^1*^2' 


n 


n 


n 


If  there  are  two  nonzero  Q^,  and  say,  then 


(  I  F(x)  -  S^(x)  I  >  d)  =  (Qi  +  Q2  -  QiQ2^ 


n: 


n  •  •  n  A .  ...  n  . 

1  Z  n 


since  we  are  interested  only  in  whether  or  not  S  (x)  exceeds  the  con- 

n 

fidence  bands,  not  in  how  often  it  does  so  within  a  given  pattern  of 
interval-occupancy.  The  probabilities  are  then  summed  over  all 
critical  patterns,  i.e.  ,  those  in  which  the  maximum  absolute  devia¬ 
tion  at  the  endpoints  of  one  of  the  n  intervals  lies  between  d  -  l/n 
and  d.  This  sum  plus  the  previously  obtained  probability  that  d 
will  be  equalled  or  exceeded  at  an  endpoint  is  the  probability  that 
max  [  S^(x)  -  F(x)  j  >  d  at  any  mutual  abscissa  value. 


Probabilities  obtained  in  this  manner  are  appropriate  for 
a  two-sided  test.  Probabilities  for  a  one-sided  test  can  be  derived 
in  analogous  fashion  by  substituting  "maximum  deviation  in  a  single 
predesignated  direction",  d^  for  "maximum  absolute  deviation",  d, 
in  the  above.  Thus  instead  of  Pr  (max  |  Sn(x)  -  F(x)  j  >  d)  one 

obtains  either  Pr  (max  (Sn(x)  -  F(x))  >  d')  or  Pr  (max  (F(x)  - 

Sn(x))  >  d'). 


b.  Null  Hypothesis.  The  parent  population  from  which  the 
sample  was  drawn  is  identical  to,  i.e.  ,  is  completely  and  exactly 
defined  by,  the  hypothesized  population  whose  cumulative  distribu¬ 
tion  is  F(x). 


c.  Assumptions.  Sampling  is  random,  observations  are 
independent,  and  the  sampled  population  is  continuously  distributed. 


258 


d.  Treatment  of  Ties.  Although  ties  cause  the  test  to  become 
imprecise,  they  require  no  special  modification  of  procedure.  So 
long  as  the  proportion  of  tied  observations  is  small,  the  tabled  prob¬ 
abilities  will  probably  be  very  close  approximations  to  the  true  ones. 

When  n  is  so  large  that  tables  whose  probabilities  are  de¬ 
rived  from  asymptotic  formulae  must  be  used,  ties  cause  the  prob¬ 
ability  error  to  be  in  the  conservative  direction.  If  the  true  prob¬ 
ability  that  max  |  Sn(x)  -  F(x)  j  >  d  is  cc,  the  tabled  probability  will 

be  no  smaller  than  oc  so  rejection  will  occur  less  frequently  than 
would  be  the  case  if  there  were  no  ties.  And  if  the  true  probability 
that  max  j  Sn(x)  -  F(x)  |  <  d  is  1  -  oc,  the  tabled  probability  will  be 
no  greater  than  1  -  oc  and  confidence  limits  obtained  from  the  tables 
at  the  nominal  1  -  oc  level  of  confidence  will  have  a  true  confidence 
level  equalling  or  exceeding  that  level  (12,  22). 

e.  Efficiency.  Van  der  Waerden  (42)  compared  the  power  of 
the  unidirectional  maximum  deviation  test  (at  a  significance  level  of 
.01)  with  that  of  the  one-sided  most  powerful  parametric  test  when 
both  the  sampled  and  the  hypothesized  populations  were  normally 
distributed  with  variance  of  1,  differing  only  in  location.  The  uni¬ 
directional  maximum  deviation  test  was  less  powerful  than  the  class¬ 
ical  test  with  the  power  discrepancy  increasing  as  sample  size  in¬ 
creased  from  2  to  3  to  5.  At  n  =  5  its  efficiency  had  dropped  to  about 
.65.  Massey  (34)  compared  the  smallest  maximum  absolute  deviations 
detectable  with  probability  .50  by  the  d  test  and  by  the  chi-square 

test  for  CCS  of  .05  and  .01  and  ns  ranging  from  200  to  2000.  The 
d  test  was  found  to  be  superior  to  chi  square  in  all  of  the  46  cases 
examined. 

Massey  (30,  31)  has  found  the  maximum  absolute  deviation 
test  to  be  consistent  provided  that  the  sampled  population  is  contin¬ 
uously  distributed,  but  biassed  for  finite  n.  He  has  also  obtained 
a  lower  bound  for  its  power.  Birnbaum  (5)  has  found  bounds  for 
the  power  of  the  one-sided,  i.  e.  ,  maximum  unidirectional  deviation, 
test. 


f.  Application.  Plot  the  cumulative  distribution  of  the  hypo¬ 
thesized  popuTaHoir’and  the  cumulative  distribution,  i.e.  step  func¬ 
tion,  of  the  sample  on  the  same  graph  as  shown  in  Figure  3  •  Find 
the  maximum  ordinatewise  deviation,  d,  between  the  two  cumulative 
distributions.  Enter  the  probability  tables  with  d  and  n  to  determine 
the  significance  of  the  result. 

g.  Discussion.  Much  of  the  literature  on  maximum  absolute 
deviation  methods  relates  to  the  setting  of  confidence  bands  for  an 
hypothesized  population.  Thus  if  Pr  (max  j  Sn(x)  -  F(x)  j  >  d)  =  oc. 
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there  will  be  a  confidence  level  of  1  -  oc  that  Sn(x)  will  stay  entirely 
within  the  band  between  two  curves  whose  ordinates  are  F(x)  +  d  and 
F(x)  -  d.  Or  if  Pr  (max  |Sn(x)  -  F(x)|  ^  d*)  =  oc  there  is  a  prob¬ 
ability  of  1  -  oc  that  Sn(x)  will  never  reach  or  exceed  F(x)  +  d*.  It 
is  to  be  noted  that  d*  at  the  level  oc  is  not  identical  to  d  at  the  level 
2  oc  although  when  n  <  100  and  oc  <  ,05  they  are  approxim,ately 
equal  (35). 

The  derivation  given  under  ’’Tlationale**  was  chosen  for 
its  conceptual  simplicity.  The  method  outlined  is  not  the  most  ef¬ 
ficient  means  of  obtaining  probabilities.  Probabilities  for  the  max¬ 
imum  absolute  deviation  have  generally  been  obtained  by  means  of 
recursion  formulae.  However,  probabilities  for  the  maximum 
unidirectional  deviation  can  be  obtained  by  use  of  a  single  exact 
formula  derived  by  Birnbaum  and  Tingey  (7), 

The  relative  merits  of  chi  square  and  the  maximum  abso¬ 
lute  deviation  test  have  been  discussed  by  a  number  of  authors,  (4, 

20,  34),  The  d  test  is  superior  to  chi  square  in  the  following  ways. 
The  d  test  requires  only  the  assumption  of  a  continuously  distri¬ 
buted  population  (other  than  the  usual  assumptions  of  randomness 
and  independence)  while  chi-square  requires,  among  others,  the 
assuroption  that  observed  frequencies  are  normally  distributed 
about  their  expected  frequencies;  thus  the  d  test  is  distribution- 
free  for  all  sample  sizes  while  chi-square  becomes  distribtuion- 
free  only  when  an  infinite-sized  sample  permits  the  normality 
assumption  to  be  fulfilled.  The  exact  distribution  of  d  is  known 
and  tabled  for  small  sample  sizes,  while  the  exact  distribution  of 
chi-square  is  known  and  tabled  only  for  infinite  sized  samples.  The 
d’  test  can  be  used  to  test  for  deviations  in  a  given  direction,  i,  e.  , 
can  be  used  as  a  one-sided  test,  while  chi-square  cannot.  The  d 
test  uses  ungrouped  data,  every  observation  representing  a  point 
at  which  the  **goodness  of  fit**  is  examined;  chi-square  loses  this 
information  by  requiring  that  data  be  grouped  into  cells.  Further¬ 
more  by  using  ungrouped  data  the  d  test  avoids  the  hazards  and 
pitfalls  associated  with  choice  of  interval  size  and  selection  of 
starting  point  in  chi-square  tests  of  fit  and  no  correction  for  con¬ 
tinuity  is  required  by  the  d  test.  The  d  test  can  be  applied  to  data 
which  become  available  sequentially  from  smallest  to  largest,  com¬ 
putations  being  continued  only  up  to  the  point  at  which  rejection 
occurs;  it  thus  has  an  **efficiency**  aspect  not  present  in  chi-square. 
Confidence  bands  can  be  easily  established  on  the  basis  of  the  dis- 
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tribution  of  d,  while  chi-square  has  no  such  analogous  property. 

More  is  known  about  the  power  of  the  d  test  than  of  chi  square,  and 
the  information  presently  available  suggests  that  in  general  it  is 
the  more  powerful  test.  Chi  square,  on  the  other  hand,  is  superior 
to  d  in  the  following  ways.  Chi  square  does  not  require  that  the  hy¬ 
pothesized  population  be  completely  known  in  advance  of  sampling. 
Certain  population  parameters  can  be  estimated  from  the  sample 
and  the  resulting  degree  of  ‘^artificial**  fit  between  obtained  sample 
and  hypothesized  population  can  be  taken  accoimt  of  and  prevented 
from  biassing  the  probability  of  significance  by  making  the  appro¬ 
priate  reduction  in  degrees  of  freedom.  No  such  adjustment  is 
possible  with  the  d  test,  which  requires  that  the  hypothesized  pop¬ 
ulation  be  completely  known  and  specified  a  priori.  Chi  square 
can  be  partitioned  and  added,  very  useful  properties  which  the  d 
statistic  does  not  possess.  Finally,  chi  square  can  be  applied  to 
discrete  populations.  The  d  test,  however,  is  not  incapable  of  such 
applications.  When  the  assumption  of  continuity  is  not  met,  the 
probability  of  d  is  expressed  by  an  inequality  rather  than  an  equation. 
The  result  is  that  the  true  probability  that  d  >  h  (or  that  d*  ^  h*)  is  no 
greater  than  the  tabled  probability.  Therefore  in  tests  of  significance 
the  true  probability  of  rejection  may  be  smaller,  but  not  greater, 
than  the  nominal  probability,  oc.  And  in  setting  confidence  limits, 
the  true  probability  of  inclusion  within  the  limits  may  be  greater,  but 
not  smaller,  than  the  nominal  probability  of  inclusion,  1  -  cx:  ,  In 
both  cases  the  probability  error  is  a  “conservative**  one.  See  (12, 

20,  22). 


Tables.  Critical  values  of  d  at  standard  significance 
levels  have  been  tabled  by  Miller  (35)  for  all  values  of  n  from  1  to 
100,  approximate  formulae  having  been  used.  A  smaller  table  has 
been  published  by  Massey  (34).  Probabilities  that  d  will  be  less 
than  c/n  have  been  tabled  by  Birnbaum  (4)  for  all  values  of  n  from 
1  to  100,  and  Massey  (29)  has  published  a  less  extensive  table.  The 
limiting  distribution  of  d  or  its  equivalent  has  been  given  by  a  num¬ 
ber  of  authors.  Massey  gives  the  values  of  d  required  for  signi¬ 
ficance  at  standard  significance  levels  when  n  is  infinite  (34)  and 
the  probability,  at  n  =  co,  that  d  <  \  /  sT ti  for  various  values  of  X 
(29).  The  latter  probability  has  been  tabled,  in  terms  of  X  rather 
than  d,  by  Kolmogorov  (22,  23).  The  limiting  distribution  of  X 
has  been  tabled  by  Smirnov  (39). 

Critical  values  of  the  maximum  unidirectional  deviation 
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between  sample  step  function  and  population  cumulative  distribution 
have  been  tabled  by  Miller  (35)  for  all  values  of  n  from  1  to  100, 
standard  significance  levels  being  used.  His  probabilities  are 
based  on  asymptotic  formulae  when  n  exceeds  20.  A  smaller  table 
of  such  values  has  been  published  by  Birnbaum  and  Tingey(7). 

i.  Sources.  3-12,  14-15,  17,  19-23,  26,  29-31,  34-37, 
39,  42-44. 


2.  Related  Tests  of  Fit 


A  statistic  somewhat  similar  to  that  outlined  in  1,  Max¬ 
imum  Deviation  Tests  for  Goodness  of  Fit  to  an  Hypothesized  Popu¬ 
lation,  has  been  considered  by  a  number  of  writers.  It  is 


2 


na> 

n 


(Sn(x)  -  F(x)  dF(x) 


which  can  be  equivalently  ex- 


1  2  W 

pressed  as  no)^  -  ^  >■  "ZH" 

i=l 


F(xp]^. 


The  statistic 


2 


na> 

n 


is  distribution  free,  and  requires  only  the  assumptions  of  random 
and  independent  sampling  from  a  continuously  distributed  population. 
Its  probabilities  have  been  tabled  for  samples  of  size  1,  2  and  3 
(28)  and  of  size  n  =  infinity  (1,  28). 


Anderson  and  Darling  (1,  2)  have  proposed  a  modification 
of  the  above  statistic  which  involves  the  application  of  a  weight 
function  to  (Sn(x)  -  F(x))  .  They  have  also  proposed  (1)  to  modify 


the  maximum  absolute  deviation  test,  described  in  1,  Maximum 
Deviation  Tests  for  Goodness  of  Fit  to  an  Hypothesized  Popula- 
tion,  by  applying  a  weight  function  to  (Sn(x)  -  F(x)(  .  These  and 
related  tests  are  discussed  in  (1,  2,  3,  11,  28,  38). 
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3.  Truncated  Mciximum  Deviation  Tests  for  Identical 
Populations. 


a.  Rationale.  Suppose  that  two  samples,  one  of  n  obser¬ 
vations  labeled  xs  and  the  other  of  m  observations  labeled  ys,  have 
been  drawn  from  continuously  distributed  populations  and  that  the 
experimenter  wishes  to  test  whether  or  not  the  sampled  populations 
are  identical.  Let  Sn{x)  and  Sm{y)  be  the  cumulative  step  ftinctions 
of  the  X  and  y  samples  respectively,  cind  let  them  be  plotted  on  the 
same  graph.  Finally  let  d^  be  the  maximum  difference  in  ordinates 
between  the  two  step  functions  at  any  abscissa  value  less  than  or 
equal  to  the  r^^  x  observation  in  order  of  ascending  size.  The  prob¬ 
ability  that  dj.  equals  or  exceeds  some  predesignated  value,  h,  has 
been  tabled  and  can  be  used  to  test  the  hypothesis  of  identical  popula¬ 
tions. 


Let  the  x  observations  be  arranged  in  order  of  increasing 

size,  X  .  X  ,  .  .  .  ,  X.,  .  .  .  ,  X  ,  ,  .  .  ,  X  and  let  the  number  of  y 
1  2  .  ^  ^ 
observations  smaller  than  x^^  be  designated  m^^,  the  number  of  y 

observations  between  x^  and  x^  be  that  m^  is  the  num¬ 

ber  of  sample  ys  between  x.  ,  and  x.,  and  let  the  number  of  y  ob- 

1-1  1  ■' 

servations  greater  than  x^  be  represented  by  M.  The  a  priori 

probability  for  any  set  of  such  frequencies ,  m  ,  m  ,  ...  m  ,  M,  is 

12  r 

^M+n-r^^^m+n^^  This  can  be  proved  as  follows:  If  the  two  samples 

are  from  the  same  population,  the  sample  designations  x  and  y  are 
arbitrary.  The  set  m^^,  m^,  .  .  .  ,  m^,  M  may  then  be  regarded  as 

having  been  obtained  by  drawing  labels,  without  replacement,  from 
a  population  consisting  of  n  x  labels  and  m  y  labels  and  applying 
them,  in  the  order  drawn,  to  the  m  +  n  observations  arranged  in 
order  of  increasing  size.  The  first  m^  labels  must  be  ys,  the  next 

must  be  an  x,  then  m^  ys  in  succession  followed  by  another  x,  etc. 


The  probability  of  drawing  the  required  label  on  any  given  draw  is, 
of  course,  the  remaining  number  of  labels  of  the  required  type 
divided  by  the  remaining  number  of  labels  of  both  types.  Thus 
the  denominator  of  the  probability  fraction  is  m  +  n  on  the  first 
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draw,  m  +  n  -  1  on  the  second,  and  1  on  the  last,  and  the  product  of 
these  denominators  is  simply  (m+n) i  .  The  numerators  will  be 

|(m)(m- l)(m-2)  ...  (m-mj^  +  l)|  |n  ]  |(m-mj^)(m-mj^- 1)  ... 

(m-m^-m^+l)  |  |(  n-l)|  ...  etc.,  or,  more  concisely 


(m-m^^)  J 


(m-mj^-m^)^ 


m:  _ 1  ,  ^ _ 1 

(m-mj^)  .'  ^  (m-m^^ -m^)!  ^  (m-m^^- 


m^-m^)^ 


T-  (n-2)  .  .  . 


(m^+M)J 

- -  (n-r+l)(n-r+M)J  ,  which,  after  making  the  obvious  cancel- 


I  X 

lations,  reduces  to  - - —  (n-r+M)I  or  ml  ni  . 

Ml  (n-r)I  M 

Dividing  this  numerator  by  the  denominator  (m+n)I  ,  the  resulting 

.n-r+M 

probability  fraction  is - ^  — - .This  is  the  probability  that  if  ob- 

^  m  ^ 

servations  from  a  sample  of  n  xs  and  m  ys  are  arranged  in  order  of 

increasing  size  m  ys  will  be  less  than  x  ,  m  ys  will  lie  between 

1  12 

X.  and  x^,  etc.  ,  m  ys  will  lie  below  x  and  M  ys  will  lie  above  x  . 
i  Z  r  r  I* 

Obviously  it  is  also  the  probability  that  m^  ys  will  lie  below  x^, 

m,  +  m  ys  will  lie  below  x  etc.  ,  etc.  ,  and  m  +  m  +  .  .  .  +  m 
I  2  12  r 

ys  will  lie  below  x  .  And  therefore  it  is  the  probability  that  at 

r 

the  abscissae  x  ,  x  ,  x  ,  .  ,  .  ,  x  the  ordinates  of  the  step  function 
12  3  r 

”^1  +  ”^2  ”^1  ”^2  ^  ”^3 

Sm(y)  will  be  m  /m,  -  - ' -  ’ 

1  m  ^  m 


,n-r+M. 


ra  +m  +m  -1-,.,+m 
1  2  3 


2;* 

“  .  At  the  same  abscissae,  the  ordinates 


m 


of  Sn(x)  jump  a  distance  1 /n,  remaining  constant  at  abscissae 

values  in  between.  So  the  above  probability  is  also  the  probability 

that  at  abscissae  infinitesimally  smaller  than  x,  ,  x^,  .  .  .  ,  x  ,  the 

1  2  r 
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difference  in  ordinates  between  the  two  step  functions  will  be 


o  "^11 
n  m  n 


m 


y  •  •  •  y 


m 


r-1 


n 


m  +  rn^  +  ... 


and 


m 


that  at  abscissae  infinitesimally  larger  than  x^,  x  ,  .  .  .  ,  the 


m , 

1  1  2 

differences  in  ordinates  will  be - -  ,  — 

^  m  n 


m  +  m 
1  2 

m 


1  1  2  "^  *  *  *  r  -  1 

~  _ - 1 —  •  The  maximum  absolute  deviation,  d  , 

n  m  r 

must  be  an  ordinate  difference  in  one  of  these  two  sets  since,  in  the 
interval  between  x^  and  the  ordinate  of  Sn(x)  remains  constant 

while  the  ordinate  of  Sm(y)  has  its  lowest  value  at  x^  and  its  highest 

value  at  Thus  the  pattern  of  xs  and  ys,  when  arranged  in 

order  of  increasing  size,  determines  the  maximum  ordinatewise 
difference  between  the  x  and  y  sample  step  functions,  and  the  prob¬ 
ability  that  d^  equals  or  exceeds  h  is  simply  the  sum  of  the  prob¬ 
abilities  of  the  arrangements  of  xs  and  ys  in  which  >  h.  Other¬ 
wise  stated,  if  d  >  h  in  K  of  the  distinguishable  arrange- 

r  =  '  m  ' 

ments  of  xs  and  ys,  then  the  a  priori  probability  that  d^^  h  is 

K  If  n  are  both  very  small,  K  can  be 

determined  by  forming  all  patterns  of  xs  and  ys  and  counting  the 
patterns  for  which  h.  For  larger  values  of  m  and  n,  recursion 

formulae  are  used  to  determine  K, 

The  ordinate  of  Sn(x)  reaches  a  height  of  r/n  when  an  ab¬ 
scissa  of  x^  is  reached.  Therefore,  if  h  is  selected  to  be  a  value 
greater  than  r/n,  then  at  any  abscissa  up  to  and  including  x^,  the 
ordinate  of  Sn(x)  cannot  exceed  that  of  Sm(y)  by  a  difference  of  h  or 
more,  although  the  reverse  may  occur.  Thus  when  h>  r/n,  the 
test  is  one-sided  in  the  sense  that  the  null  hypothesis  can  be  re¬ 
jected  only  because  of  an  excess  of  ys  over  xs  in  the  region  below  x^. 
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Even  when  h  <  r/n,  the  number  of  xs  below  is  limited  while  the 
number  of  ys  in  this  region  is  not;  therefore,  other  things  being 
equal,  a  large  is  more  likely  to  be  the  result  of  an  excess  of 
ys  over  xs  in  this  region  than  of  the  reverse.  The  result  is  that 
the  test  is  more  likely  to  reject  when  there  are  too  many  ys  below 
x^  than  when  there  are  too  few,  the  bias  increasing  as  r  decreases. 

This  situation  can  be  remedied  and  the  test  made  unbiassed- 
ly  two-sided  by  taking  the  mctximum  ordinatewise  deviation  below  the 
r^^  X  or  the  r^^  y,  in  ascending  order,  whichever  is  the  larger. 

This,  however,  requires  some  modifications  in  derivations  and 
formulae.  Let  d*^  be  the  maximum  absolute  deviation  below  x^ 
or  y^,  whichever  is  larger.  If  x^  >  y^,  then  at  least  r  ys  lie  below 
x^,  or,  otherwise  stated,  M  can  be  no  greater  than  m-r.  Thus 
the  probability  that  d*^  >  h  and  that  x^  >  y^  is  obtained  by  taking  as 
K  that  number  of  arrangements  in  which  d^  >h  counted  only  from 
those  arrangements  in  which  x^>  y^  or,  equivalently,  in  which 
M  <  m-r.  Identifying  the  modified  K  as  K’, 

Pr  (d'  =  h,  X  >  y  )  =  K'  Likewise, 

Pr(d'  =h,  y  >x  )  =  K"(^  )  with  K"  and  N  defined  analo- 

'r  ■'r  r  N  m 

gously  to  K'  and  M.  Since  mutually  exclusive 

events,  the  probability  that  d'^>h,  when  d'^  is  the  maximum  ordinate¬ 
wise  deviation  occurring  below  whichever  of  the  two  values  x^andy^  is  the 
larger,  is  simplythe  sum  of  the  separate  probabilities  for  these  mutually 

exclusive  events.  Thus  Pr  (d'  >h)  =  - ^  .  -  • 

r  =  .m+n . 

'  m 


Some  of  these  probabilities  have  also  been  tabled. 

b.  Null  Hypothesis.  Each  of  the  )  distinguishable 

arrangements  of  xs  and  ys  is  equally  likely  to  be  the  pattern  ob¬ 
tained  when  the  sample  observations  are  arranged  in  order  of  in¬ 
creasing  size.  The  null  hypothesis  will  be  true  if  the  two  samples 
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come  from  the  same  population.  It  will  be  false,  but  will  be 
rejected  at  the  same  level,  oc,  as  if  it  were  true,  if  the  two  sam¬ 
ples  have  been  drawn  from  populations  which  are  identical  at  values 
less  than  or  equal  to  the  critical  value,  or  y^,  and  nonidentical  at 
values  above  it. 

c.  Assumptions.  Observations  are  drawn  randomly  and 
independently  from  continuously  distributed  populations. 

d.  Treatment  of  Ties,  A  relatively  small  number  of  ties 
are  a  practical  problem  only  if  an  x  and  a  y  observation  are  tied 

for  the  abscissa  value  at  which  the  maximum  ordinatewise  deviation, 
d^,  occurs.  In  this  case,  for  a  conservative  test,  the  x  and  y 
should  be  slightly  separated  so  as  to  give  their  ordinates  the  lesser 
deviation,  and  the  maximum  deviation,  d^,  should  be  redetermined 
by  examination  of  the  entire  graph.  Or,  to  minimize  error,  break 
such  ties  in  all  possible  ways,  find  d^  for  each  such  way  and  obtain 
its  probability,  then  use  the  average  of  these  probabilities. 

e.  Efficiency.  Epstein  (18)  empirically  tested  the  relative 
efficiencies  of  the  Wilcoxon  test  for  unpaired  observations,  the 
’Tully  two-sided**  version  of  the  present  test  (in  which  d*^  is  chosen 
from  below  max  x^,  y^),  Epstein's  version  of  the  exceedances  test, 
and  the  Wald-Wolfowitz  total  number  of  runs  test.  The  tests  were 
applied  to  two  hundred  pairs  of  samples  of  ten  observations  from 
each  of  two  populations  differing  in  means  but  having  normal  dis¬ 
tributions  and  equal  variances.  The  order  in  which  the  tests  are 
listed  above  is  the  order  of  their  efficiency,  from  best  to  worst, 

in  detecting  the  difference  between  the  population  means. 

f.  Application.  Forty  type  x  and  forty  type  y  light  bulbs 
are  placed  on  life  test.  It  is  decided  in  advance  to  reject  the  hy¬ 
pothesis  of  identical  life-expectancy  populations  if,  by  the  time 
the  fifth  bulb  of  each  type  has  blown,  an  ordinatewise  deviation  of 
probability  <  ,  05  has  occurred.  Therefore  m=n  =  40,  r  =  5,  d^^ 

is  the  maximum  ordinatewise  deviation  below  x  or  y  ,  whichever 

5  5 

is  larger,  and  oc=  ,  05.  The  bulbs  blow  in  the  following  order 

Xi,  X3.  x^,  X3.  x^,  x^.  y^,  x^.  x^,  y^,  x^^.  x^^,  x^^. 

The  test  is  halted  when  Xj^^  blows  because  12-3  *  9  and  Tsao's 
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tables  (40)  show  that  for  m=  n  =  40  and  r  =  5  a  d'g'5  8/40  has  prob¬ 
ability  .  96,  so  a  d'  >9/40  is  significant  at  less  than  the  .  05  level. 

D 

g.  Discussion.  In  respect  to  its  derivation  the  present 
test  is  closely  related  to  tests  based  upon  exceedances. 

h.  Tables.  Tsao  (40),  who  originated  the  test,  has  tabled 
the  probability  that  d^<  c/m  for  certain  equal  sized  samples  between 
m  =  n  =  3  and  m=  n=  40  with  r  never  exceeding  10.  He  has  also  pre¬ 
pared  (40)  a  similar  table  for  the  probability  that  d'j.<  c/m. 

i.  Sources .  (18,  40). 


4.  Maximum  Deviation  Tests  for  Identical  Populations 


a.  Rationale.  Suppose  that  r  is  set  equal  to  n  in  the  preceding 

test.  The  test  statistic  becomes  d  ,  the  maximum  difference  in  or- 

n 

dinates  between  Sn(x)  and  Sm(y)  at  any  abscissa  value  below  x^.  But 

at  the  abscissa  x  ,  the  ordinate  of  Sn(x)  is  1.  The  deviation  between 
n 

the  two  step  functions  cannot  be  greater  at  abscissae  above  x^  than  it 

is  at  X  .  So  the  criterion  d^  is  equivalent  to  using  as  test  statistic  d, 

the  maximum  ordinatewise  deviation  between  Sn(x)  and  Sm(y)  at  any 
common  abscissa.  In  the  preceding  section  the  probability  that,  at 
some  abscissa  value  less  than  x^,  meix  {  Sn(x)  -  Sm(y)  |  >  h  was  found 

to  be  K  (^^^  ^)  /  where  K  was  the  number  of  distinguishable 

M  m 

arrangements  of  xs  and  ys  resulting  in  a  d^  >  h.  Substituting  n  for  r, 


the  probability  becomes  K  (^^  ^)  /  which  reduces  to  K/(’^’^^)  . 

This  was  to  be  expected  since  (^^  ^)  number  of  arrangements 

of  n  xs  and  m  ys  in  which  m^^  ys  are  below  x^^,  m^  ys  are  between  Xj^ 

and  X.,,  .  .  .  ,  m  ys  are  between  x  ,  and  x  ,  and  M  ys  are  above  x  . 

Z  r  ^  r-1  r  •'  r 
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When  r  <  n,  there  are  a  number  of  ways  in  which  M  ys  can  be  located 
above  x^,  each  distinguishable  pattern  of  arrangement  of  M  ys  and 


n  -  r  xs  constituting  a  different  way.  When  r  =  n,  there  is  only  one 
way  in  which  the  M  ys  can  be  located  above  x^.  The  distribution  of 

the  ys  among  the  xs  is  completely  specified,  so  the  specification  can 
be  met  by  only  one  of  the  distinguishable  patterns  of  arrange¬ 

ment  of  xs  and  ys.  The  maximum  absolute  deviation  test,  d,  is 
therefore  a  special  case  of  the  truncated  maximum  absolute  deviation 
test,  d^,  described  in  the  preceding  section.  The  test  can,  of  course. 


be  made  one-sided  by  substituting  d*,  the  maximum  unidirectional 
deviation,  for  d  and  K*,  the  number  of  the  arrangements,  in 

which  the  maximum  unidirectional  deviation  equals  or  exceeds  a 
specified  value,  h*,  for  K.  Thus,  Pr  (max  ^Sn(x)  -  Sm(y)|*  >  h*)  = 

for  a  one-sided  test:  and  for  a  two-sided  test 
'  '  m  4. 

Pr  (max  |  Sn(x)  -  Sm(y)  |  >  h)  =  K/ • 


b.  Null  Hypothesis.  Each  of  the  (  ^  )  distinguishable  ar¬ 
rangements  of  xs  and  ys  is  equally  likely  to  be  the  pattern  obtained 
when  the  sample  xs  and  ys  are  arranged  in  order  of  increasing  size^ 
This  will  be  the  case  if  the  two  samples  are  drawn  from  the  same 
population. 


c.  Assumptions.  See  3,  Truncated  Maximum  Deviation 
Tests  for  Identical  Populations. 

d.  Treatment  of  Ties.  See  3.  When  m  and  n  are  both  so 
large  that  tables  whose  probabilities  are  derived  from  asymptotic 
formulae  must  be  used,  ties  cause  the  probability  error  to  be  in  the 
conservative  direction.  The  tabled  probability  will  be  no  smaller 
than  the  true  probability,  so  rejection  will  occur  less  frequently  than 
would  be  the  case  if  there  were  no  ties  (12,  22). 

e.  Efficiency.  Applied  to  samples  of  size  5  and  infinity 
from  normal  populations  with  equal  variances  but  different  means, 
the  maximum  absolute  deviation  test  has  an  efficiency  of  .  65  rela¬ 
tive  to  Student *s  t-test,  for  both  one-sided  and  two-sided  tests.  In 
the  same  situation,  but  with  samples  of  sizes  5  and  6  the  test  is  more 
efficient  than  the  total  number  of  runs  test,  but  less  efficient  than  the 
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Mann-Whitney  test  or  the  X-test  (41).  Applied  to  equal-sized 
samples  of  size  3,  4,  or  5  from  normal  populations  with  equal 
variances  and  different  means,  the  maximum  absolute  deviation 
test  was  more  efficient  than  Westenberg's  median  test  and  less 
efficient  than  the  Mann-Whitney  test  (13).  It  is  more  efficient 
than  the  total  number  of  runs  test  and  less  efficient  than  the  Mann- 
Whitney  test  when  applied  to  large  samples  against  the  nonpara- 
metric  alternatives  investigated  by  Lehmann  (25).  (See  Intro¬ 
duction). 


The  test  has  been  proved  consistent  by  Massey  (30) 
provided  only  that  the  sampled  populations  are  continuously  dis¬ 
tributed.  See  also  (24).  However,  the  test  is  biassed  for  finite 
n  (24,  30.  31). 

f.  Application.  Let  the  sample  data  be  represented  by 
the  following  table,  observations  being  listed  in  increasing  order 
of  size. 
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Corresponding 

Corresponding 

Difference 

X- observation 

ordinate  of  Sn(x) 

y-observation 

ordinate  of  Sm(y) 

in  ordinates 

-512 

1/16 

1 

1/16 

-509 

2/16 

2/l6 

-487 

3/16 

3/16 

-422 

1/4 

1/16 

-415 

2/4 

5/16 

-409 

4/l6 

\ 

4/16 

-398 

3/4 

8/16 

-360 

4/4  ' 

I  12/16 

-341 

5/16 

j 

1  11/16 

-312 

6/16 

i 

1  10/16 

-275 

7/16 

9/16 

-202 

8/16 

8/16 

-111 

9/l6 

1 

7/16 

-58 

10/16 

6/16 

-14 

11/16 

5/16 

9 

12/16 

4/16 

21 

13/16 

3/16 

75 

14/16 

1 

! 

i  1 

2/16 

1 

156 

15/16 

1  ' 

1  I 

1 

[  1 

»  1 

;  1/16 

1 

201 

1 

1  16/16 

i_ 1 

1 

1 

1 

1 

0/l6 

The  maximum  ordinatewise  deviation  is  12/l6  which  for  samples  of 
sizes  4  and  l6  is  found,  by  using  Massey's  tables  (32)  to  have  a  prob¬ 
ability  of  .034  of  being  equalled  or  exceeded.  Therefore  the  hypothe  - 
sis  of  a  common  population  would  be  rejected  if  a  significance  level 
of  .  05  were  used. 
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g.  Discussion.  Drion  (l6)  has  derived  exact  probabilities, 
without  resort  to  recursion  formulae,  by  use  of  random  walk  methods. 
His  formulae,  however,  require  samples  to  be  of  equal  size.  He  finds 

2 

the  probability  that  d  >  c/n  to  be  — 


'•  'n-c'  'n-2c'  'n-3c' 


2n 

'n-4c' 


+  •  .  •  ] 


**the  series  being  continued  as  long  as  n  -  kc  >  0*^.  The  sample  sizes 
are,  of  course,  n  and  m  =  n.  The  probability  that  the  maximum  uni¬ 


directional  deviation,  Sn(x)  -  Sn(y),  exceeds  c/n  is  found  to  be 


Drion  has  also  used  random  walk  to  investigate  the  prob¬ 
ability  that,  disregarding  the  endpoints  whose  ordinates  are  zero 
and  one,  two  sample  step  functions  will  not  intersect,  i.e.  ,  that 
either  one  of  the  sample  step  functions  will  lie  entirely  above  the 
other.  If  the  two  samples  are  of  equal  size  and  come  from  the  same 


population  this  probability  is 


1 

■zin* 


If  the  samples  come  from  the 


same  population  but  are  of  different  sizes,  n  and  m,  and  if  n  and  m 
are  coprime,  the  probability  is 


n+m 


h.  Tables .  Massey  has  tabled  the  probability  that  d  will 
not  exceed  specified  values  for  equal  sized  samples  with  1  £  m  = 
n  <  40  (33),  for  equal  or  unequal  sized  samples  with  m  <  10  and 
n  <  10,  and  for  samples  of  selected  larger  sizes  (32).  A  small 
table  for  use  with  equal-sized  samples  has  been  published  by  Drion 


(16). 


The  limiting  cumulative  distribution  of  d 


mn 


m+n 


has  been 


tabled  by  Smirnov  (39)t  thereby  permitting  the  approximate  prob¬ 
ability  of  a  given  d  to  be  obtained  when  m  and  n  are  both  very  large. 
Goodman  (20)  has  published  a  table  of  probabilities  for  the  maximum 
unidirectional  deviation  between  ordinates  of  step  functions  of  equal¬ 
sized  samples  for  sample  sizes  ranging  from  1  to  50. 


i.  Sources.  8-16,  19-22,  24-25,  30-33,  36,  39,  41. 
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5.  A  Large  Sample  Test  Using  Grouped  Data 


Marshall  (27)  has  proposed  an  approximate  test,  for  use 
when  m  and  n  are  large,  which  involves  grouping  the  data  into  class 
intervals.  The  range  of  the  variables  is  divided  into  j  +  1  intervals 
by  the  selection  of  j  arbitrary  points,  and  the  unidirectional  differ¬ 
ence  in  step  function  ordinates  is  measured  at  each  of  these  points. 
These  differences  are  then  summed.  The  sum,  S,  is  normally 
distributed  in  the  limit  with  mean  zero  and  variance 


(— +  -) 
m  n 


P.Q.+2)  )  P.Q,). 

i=l  k=i-l 


The  values  of  P.,  however, 
1  ' 


must  be  obtained  from  their  maximum  likelihood  estimates, 

P.  «  ^  t  ^  Sm(i)  ^here  Sn(i)  and  Sm(i)  are  the  ordinates 

^  m+n 

of  the  two  step  functions  at  the  abscissa  point  i  which  is  one  of  the 
j  arbitrarily  chosen  points  dividing  the  data  into  intervals.  The  test 
is  conducted  by  referring  the  critical  ratio  to  normal  tables.  Its 
asymptotic  power  efficiency  is  .64  for  j  *  1,  .91  for  j  =  5,  and  .94 
for  j  «  10  when  used  to  test  for  a  difference  in  means  between  two 
normally  distributed  populations  with  equal  variances. 
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CHAPTER  XII 


MULTI-SAMPLE  TESTS 


Distribution-free  tests  to  detect  a  differential  effect  among 
three  or  more  treatments  are  often  simply  generalizations  of  an 
analogous  distribution-free  test  for  the  two-treatment  case.  In 
the  following  chapter,  the  rank  tests  for  unmatched  and  matched 
data  are  generalizations  of  the  Wilcoxon  and  sign  tests  respectively, 
while  the  median  test  generalizes  the  two  sample  test  of  the  same 
name.  Most  of  the  remaining  tests  are  at  least  analogous  to,  if 
not  direct  generalizations  from,  a  two-sample  distribution-free 
test.  However,  in  no  case  is  the  parallelism  complete.  The  test 
statistic  may  be  based  on  essentially  the  same  sample  information, 
but  may  take  a  different  form;  or  its  exact  probabilities  may  have 
been  tabled  only  for  the  tiniest  of  sample  sizes,  asymptotic,  approx¬ 
imate  formulae  being  employed  to  calculate  probabilities  in  all  other 
cases.  For  these  and  other  reasons,  the  multi-sample  tests  appear 
to  have  more  in  common  with  each  other  than  with  the  two -sample 
test  which  they  ^'generalize.  "  They  are  therefore  presented  to¬ 
gether  in  a  single  chapter. 
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1.  Rank  Tests  for  Unmatched  Data 


a.  Rationale ♦  Suppose  that  observations  have  been  taken 
under  a  variety  of  conditions,  that  the  observations  are  continuously 
distributed  but  unmatched,  and  that  it  is  desired  to  test  whether  or 
not  the  observations  recorded  under  the  various  conditions  all  belong 
to  the  same  population.  Rank  the  observations  from  1  to  N,  where 
N  is  total  number  of  observations  recorded  under  all  conditions. 

Now  construct  a  table  with  C  columns,  representing  conditions,  and 
with  a  number  of  rows  equal  to  the  greatest  number  of  observations 
recorded  under  a  single  condition.  Enter  each  rank  under  the  ap¬ 
propriate  column  paying  no  attention  to  the  row  into  which  it  happens 
to  fall.  Let  m  represent  the  number  of  entries,  i.  e.  ,  the  number 

th 

of  occupied  cells,  in  the  i  column,  and  let  R^  represent  the  sum 

th 

of  the  ranks  in  the  i  column.  The  average  rank  entry  in  the  entire 
table  is  (N+l)/2,  and,  if  the  null  hypothesis  is  true,  it  is  also  the 
"population”  average  for  the  m  rank  entries  in  the  i^^  column. 

th 

The  expected  column  sum  for  the  i  column  is  therefore  m  (N+l)/2. 
Let  S  represent  the  sum  of  the  squared  deviations  of  the  column  sums 


from  their  expected  values,  then  S 


=  Z[-. 


n.  (N+1) 


For  a  given  table,  N  cells  are  occupied.  There  are  N  1 
ways  in  which  the  ranks  from  1  to  N  could  have  been  assigned  to 
these  N  cells  by  chance,  and  if  chance  is  the  only  determining 
factor  each  of  these  ways  is  equally  likely.  Therefore,  to  deter¬ 
mine  the  probability  of  an  S  as  great  or  greater  than  that  obtained, 
one  need  only  find  the  number  of  the  NJ  tables  which  yield  such  an 
S  and  divide  by  NJ  This  method  however  involves  excessive  com¬ 
putation.  The  m  observations  in  the  i^h  column  can  be  permuted 

in  n^l  ways  without  changing  and  therefore  without  affecting  the 

value  of  S.  For  each  such  permutation,  the  within-column  entries 
of  every  other  column  can  be  likewise  permuted,  so  there  are 


n  (n.i)  ways  of  permuting  within-column  entries  without 
i=l  ^ 


affecting 


the  value  of  S.  Therefore,  one  may  save  labor  by  confining  his 
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N ' 

attention  to  the  - - -  tables  which  can  be  formed  by  permuting 

n^(n.l ) 

i=l  ^ 

entries  from  one  column  to  another.  If  there  are  the  same  number 
of  entries  in  every  column,  permutations  which  merely  interchange 
entire  columns  of  entries  do  not  change  S,  and  since  there  are  C! 
such  column  permutations  possible  for  each  **table**  (i.  e,  ,  for  any 
given  permutation)  further  labor  can  be  saved  by  taking  as  one*s 


N  * 

population  of  tables  only  the - - -  tables  whose  permutations 

an  .(n  I) 
i=l 

exclude  permutations  of  entire  columns  and  permutations  of  entries 
within  columns.  In  either  case  the  cumulative  probability  of  a  given 
value  of  S  is  simply  the  proportion  of  the  restricted  population  of 
**tables**  which  yield  an  S  equal  to  or  greater  than  the  given  value. 
Exact  probabilities  have  been  calculated  for  S  and  for  a  statistic, 


12 

H.  which  equals  ^  - 2 - 

i=l 


"*1  ^  *  « 

J  and  which  is  equivalent 


n. 

1 


to  S,  since  the  n^^  are  parameters  for  the  exact  tables  of  both  S 
and  H, 


The  calculations  required  for  the  exact  method  become 
unwieldy  and  impractical  at  very  modest  sample  sizes,  at  which 
point  approximations  must  be  relied  upon.  Owing  to  the  effect 
described  in  the  Central  Limit  Theorem,  a  column  mean  or  sum 
tends  to  become  normally  distributed  as  the  number  of  observations, 
n^,  upon  which  it  is  based  increases  (assuming  C  fixed)  or,  perhaps 
to  a  somewhat  lesser  degree,  as  the  number,  C,  of  different  values 
an  observation  can  assume  increases  (assuming  n-  fixed).  There¬ 
fore,  roughly  speaking,  the  tendency  to  normality  generally  in¬ 
creases  with  increasing  N,  If  the  null  hypothesis  is  true,  each 
column  mean  comes  from  a  population  of  **column  means”  whose 


mean  is 


N+1 


and  whose  variance  is 


N^.l 

12n. 

1 


N-n. 
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(The  variance  of  a  distribution  of  means  is  —  where  cr  is 

the  population  variance,  which  in  the  case  of  sampling  without  re¬ 


placement  from  the  population  of  integers  from  1  to  N  is 


and  where  N  and  n  are  the  respective  sizes  of  the  population  and  of 
the  sample.  )  Therefore,  if  the  null  hypothesis  is  true  and  if  N  is 
large  enough  for  the  i^^  column  mean  to  have  an  essentially  normal 


N+l 

IT  ■  ~ir 

distribution,  - is  a  standardized  normal  deviate  with 

N-n. 

1  \ 


zero  mean  and  unit  variance.  The  sum  of  C  squared  standardized 
normal  deviates  has  a  chi  square  distribution  with  C  degrees  of  free¬ 
dom  if  the  deviates  are  independent.  In  the  present  case,  of  course, 
they  are  not  independent:  if  C-1  of  the  are  known,  the  remaining 
one  can  be  obtained  by  subtraction  from  N.  However,  by  making 
mathematical  allowance  for  the  correlation,  the  above  approach  can, 
with  a  slight  modification,  be  used  to  obtain  a  test  statistic. 


H= 


N-1 

N 


1 


which  is  distributed  approximately  as  chi 


square  with  C-1  degrees  of  freedom.  An  equivalent  formula,  which 


1 2  \ — R  ^ 

is  more  efficient  for  computation,  is  H  ■  -3  (N+l)  +  _ •  i 


N{N+1) 


i=l  ^ 


b.  Null  Hypothesis.  The  a  priori  probability  that  a  given 
rank  will  belong  to  an  observation  in  the  i^h  column  is  n^^/N.  This 
will  be  the  case  if  all  N  observations  are  members  of  the  same  popu- 
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lation,  i.  e.  ,  if  conditions  have  not  affected  observations  differentially, 
and  if  all  assumptions  are  met. 


c.  Assumptions.  Observations  have  been  drawn  randomly 
and  independently  from  continuously  distributed  populations  (which 
would  be  identical  if  conditions  had  equal  effects.  )  If  the  approx¬ 
imate  test  is  used  it  must  be  further  assumed  that  no  n.  is  very 

small,  i.  e.  ,  that  in  every  column  there  are  enough  observations  so 
that,  in  accordance  with  the  Central  Limit  Theorem,  the  mean  of  the 
n^  ranks  will  be  essentially  normally  distributed  about  the  grand  mean 

.  N+1 

of  -  . 

2 

d.  Treatment  of  Ties,  When  all  the  observations  forming 

a  tied  group  lie  in  one  column  of  the  table,  ties  may  be  resolved  arbi¬ 
trarily.  In  all  other  cases  a  conservative  test  calls  for  ties  to  be  re¬ 
solved  in  the  manner  least  conducive  to  rejection  of  the  null  hypothesis; 
however,  probability  error  will  be  minimized  in  the  long  run  if,  instead, 
such  ties  are  assigned  the  midrank  of  the  tied-for  ranks.  If  the  latter 
method  is  employed  and  if  the  large  sample  (i.  e.  approximate)  version 
of  the  test  is  used,  the  mathematical  effect  of  ties  can  be  compensated 
for  by  calculating  H  from  the  following  formula: 


H  = 


1  ?  "x"'  ^  ^  ■ 

-3(N+1)  +  ]s[(n+1)  ^  ~ 

_ ril  ^ 

) 

1  L-i 

^ - 3 - 

N  -  N 


where  T  =  t  -tand  t  is  the  number 


of  observations  tied  for  the  same  rank. 


e.  Efficiency.  When  both  tests  are  applied  to  populations 
having  normal  distributions  differing  only  in  location  (and  therefore 
having  equal  variances)  the  H  test  has,  relative  to  the  F  test  of 
analysis  of  variance,  an  asymptotic  relative  efficiency  of  S/tt  or  .955. 
Undir  the  same  circumstances,  it  is  more  efficient  than  the  Brown- 
Mrod  median  test  which  has  an  A.R.E.  of  2/3  relative  to  the  H  test. 

If,  in  the  above,  "uniform  distribution"  is  substituted  for  "normal 
distribution",  the  A.R.E.  of  the  H  test  relative  to  the  F  test  be¬ 
comes  1.  00  and  that  of  the  median  to  the  H  test  becomes  l/3.  The 
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A,  R,  E.  of  the  H  test  relative  to  the  F  test  can  exceed  1 . 00  for 
Certain  types  of  distribution  (2).  However,  if  the  distributions 
have  identical  shapes,  it  cannot  fall  far  below  1.  00,  The  finding 
of  Hodges  and  Lehmann  concerning  the  efficiency  of  the  Wilcoxon 
test  relative  to  the  t  test,  applies  also  to  the  efficiency  of  the  H 
test  relative  to  the  F  test:  If  samples  are  from  continuous  distri¬ 
butions,  differing  only  in  location,  the  A,R,E,  of  the  H  test  relative 
to  the  F  test  can  never  fall  below  •  864. 

The  H  test  is  consistent  against  translation  alternatives 
(2),  More  generally,  it  is  consistent  if  for  some  one  of  the  C  pop¬ 
ulations  the  probability  that  a  randomly  selected  observation  from 
that  population  exceeds  a  randomly  selected  observation  from  among 
all  C  populations  is  some  value  other  than  1/2  (19,  20),  For  example, 
the  test  is  not  consistent  if  the  C  populations  are  symmetrical  with 
equal  means  but  unequal  variances,  i.  e.  ,  rejection  of  the  hypothesis 
of  identical  populations  cannot  be  assured  by  taking  infinite  sized 
samples. 


f.  Application,  Suppose  that  speed  of  reading  is  to  be 
tested  under  three  degrees  of  illumination.  Nine  subjects  are  se¬ 
lected  at  random  from  a  common  population,  and  three  subjects 
are  randomly  assigned  to  each  condition  of  illumination.  Due  to 
some  misadventure  one  subject  fails  to  complete  the  experiment. 

Let  the  data  be  as  shown  below,  the  first  table  giving  the  raw  scores 
and  the  second  one  showing  their  ranks. 


Condition  Condition 


A 

B 

C 

A 

B 

C 

22 

36 

39 

1 

4 

6 

31 

37 

44 

2 

5 

7 

35 

51 

3 

8 

Calculating  H  = 

-3  (N+1)  + 

12 

R.^ 

1 

we  obtain 

N  (N+1) 

i  =  l 
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H  =  -3  (8+1)  + 


[  —  +  + -  ]  =  6.25,  which  is  found, 

8(8+1)  '■3  2  3  ^ 


12 


by  consulting  Kruskal  and  Wallis*  tables,  to  have  a  probability 
of  .011.  This  probability  could  easily  have  been  obtained  with¬ 
out  the  use  of  tables:  There  are  8  ranks  and  8!  ways  in  which 
they  could  have  been  assigned  to  the  8  cells  of  the  above  table. 
However  permutations  among  the  3  cell  entries  in  column  A,  or 
the  2  in  column  B,  or  the  3  in  column  C  are  of  no  interest  nor 
are  permutations  of  the  entire  set  of  entries  under  column  A  with 


those  under  column  C.  Therefore,  only 


8! 


3!  21  31  2! 


or  280 


tables  need  concern  us.  Of  these  280  tables,  only  3  yield  values 

of  \  ^  as  great  or  greater  than  the  value  actually  obtained. 

L—t  n. 


They  are  as  follows: 


Condition 

ABC 

3  1  6 

4  2  7 

5  8 


Condition 

ABC 

1  4  6 

2  5  7 

3  8 


Condition 

ABC 

1  7  4 

2  8  5 

3  6 


The  probability  of  a  table  as  extreme  or  more  extreme  than  that 
obtained  is  therefore  3/280  or  .011. 


g.  Discussion.  Wilcoxon's  two-sample  test  for  unmatched 
data  assigned  ranks  to  observations,  irrespective  of  the  sample  to 
which  they  belonged,  then  applied  Fisher's  method  of  randomiza¬ 
tion  to  the  rank  sums  of  the  samples.  White  extended  the  test  to 
samples  of  unequal  size.  The  present  test  is  a  generalization  of 
the  Wilcoxon- White  test  to  the  multi-sample  case,  the  procedure 
differing  mainly  in  that  the  sum  of  the  squared  deviations  of  rank 
sums  from  their  expected  values,  (or  the  equivalent)  has,  in  effect, 
replaced  the  rank  sum  as  the  test  statistic.  The  Wilcoxon  form  of 
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the  test,  requiring  equal-sized  samples,  has  been  generalized 
by  Rijkoort  (29)  whose  test  statistic  is  the  value  S,  defined  under 
’’Rationale*’.  Kruskal  and  Wallis  (19,  20)  have  generalized  the 
White  form  of  the  test  which  permits  samples  of  unequal  size; 
their  test  statistic  is  H. 

The  Mann-Whitney  form  of  the  ’’Wilcoxon  test”  applies 
to  unequal  as  well  as  equal-sized  samples  and  does  not  use  rank 
sums  as  the  test  statistic.  Instead  it  employs  the  statistic,  U, 
which  is  the  number  of  times  a  Y-sample  observation  precedes 
an  X-sample  observation  when  observations  from  the  two  samples 
are  arranged  in  a  single  sequence  in  order  of  increasing  size. 

The  Mann-Whitney  test  is  a  special  case  of  the  form  of  Kendall^s 
rank  correlation  test  (tabled  by  Sillito)  which  takes  exact  account 
of  ties  in  one  ranking.  Certain  multisample  generalizations  of 
the  Mann-Whitney  test  are  mathematically  equivalent  to  this  form 
of  Kendall^s  test.  Observations  may  be  regarded  as  having  two 
characteristics:  their  value  and  the  sample  to  which  they  belong. 

All  observations  are  ranked  as  to  value,  and  this  is  the  untied 
ranking.  If  ranked  according  to  the  other  characteristic,  the 
result  is  a  ranking  containing  ties,  all  of  the  observations  in  a 
given  sample  being  tied  for  that  samplers  rank.  The  rank  corre¬ 
lation  test  then  tests  whether  or  not  the  tied  and  untied  rankings 
are  correlated,  i.  e.  ,  whether  or  not  value  ranks  are  system¬ 
atically  related  to  sample-category  ranks  (roughly,  it  tests  whether 
or  not  observation  values  are  systematically  related  to  their  sample 
categories). 

In  generalizations  of  the  Wilcoxon  and  White  forms  of  the 
’’Wilcoxon”  test,  the  alternative  to  the  null  hypothesis  is  simply 
that  samples  differ.  Specifically  the  alternative  hypothesis  is 
that  the  average  rank  of  observations  in  one  or  more  unspecified 
columns  differs  in  a  real  nonchance  way  from  the  average  rank 
for  the  entire  table.  In  generalizations  of  the  Mann-Whitney 
(and,  therefore,  ’’modified  Kendall”)  form  of  the  test,  however, 
the  alternative  hypothesis  is  much  more  specific.  It  states 
that  observation-value  ranks  are  correlated  with  their  sample 
ranks  and  therefore  specifies  the  order  of  arrangement  of  the 
samples.  Thus,  speaking  roughly,  it  states  that  observations 
of  intermediate  size  tend  to  lie  in  those  samples  ’’assigned”  inter¬ 
mediate  ranks  (and  therefore  represented  by  the  middle  columns) 
and  that  either  small  observations  tend  to  lie  in  samples  with  low 


286 


rank  (the  left  hand  columns)  and  large  observations  in  samples 
with  high  rank,  or  the  reverse  if  ^‘negative**  correlation  is  sus¬ 
pected.  The  multisample  generalizations  of  the  Wilcoxon- 
White  and  Mann-Whitney  forms  are  therefore  quite  different 
in  their  applications.  The  probability  tables  for  the  two  forms 
are  constructed  for  different  rejection  regions,  the  rejection 
region  for  the  Mann-Whitney  form  being  taken  so  as  to  maximize 
the  probability  of  rejection  when  a  specific  alternative  hypo¬ 
thesis  is  true.  Furtherm.ore  the  test  statistic  for  generalizations 
of  the  Mann-Whitney  test  is  based  upon  inversions  rather  than 
rank  sums  and  can  assume  a  greater  number  of  gradations  of 
value  than  can  generalizations  of  the  Wilcoxon  or  White  tests. 

Multisample  generalizations  of  the  Mann-Whitney  test 
have  been  proposed,  and  their  exact  small  sample  probabilities 
tabled,  by  Terpstra  (43)  and  by  Whitney  (5Z).  Multisample  tests 
equivalent  to  KendalFs  rank  correlation  S,  with  exact  allowance 
for  ties  in  one  ranking,  have  been  developed  by  Krishna-Iyer 
(18),  Terpstra  (45)  and  Jonckheere  (12),  exact  small  sample 
probabilities  having  been  tabled  by  the  last  two  authors.  These 
tests,  in  effect,  require  that  ^'columns**  be  arranged  in  an  order 
implied  by  the  alternative  hypothesis;  they  then  test  whether  or 
not  this  order  bears  a  **chance"  relationship  to  the  rank  order  of 
the  observations  in  the  table  so  constructed. 

h.  Tables.  Exact  probabilities  for  H  have  been  tabled 
(20,  21,  1-43)  for  the  case  of  three  samples,  none  of  which  con¬ 
tains  more  than  five  observations ,  i.  e.  ,  C=3;n^,  n^,  n^  <5. 

(Samples  not  necessarily  of  equal  size).  Exact  probabilities  for 
S  have  been  tabled  (29,  21)  for  the  cases  in  which  the  number  of 
samples  is  3,  4  or  5,  each  sample  containing  an  equal  number 
of  observations  (2,  3,  4  or  5  in  the  first  case,  2  or  3  in  the  second, 
and  2  observations  in  the  third  case  in  which  there  are  5  samples). 

Various  approximations  exist  for  cases  not  covered  by 
the  exact  tables.  As  indicated  under  Rationale,  H  is  distributed 
approximately  as  ^  with  C -  1  degrees  of  freedom,  and  so  is 


12(C-1)S 
(N+1)(n'^  - 


The  chi  square  approximation  is  the  easiest 
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to  use,  but  it  is  not  the  best.  Closer  approximations  are  dis¬ 
cussed  in  (19,  20,  29).  A  nomogram  for  obtaining  probability 
levels  for  the  H  or  S  tests  is  given  in  (30). 


50,  52, 


1. 

1-43. 


Sources.  2,  12,  18,  19,  20,  21,  29,  30,  43,  44,  45, 


2.  Rank  Tests  for  Matched  Data 


a.  Rationale.  Suppose  that  each  of  m  subjects  has  per¬ 
formed  under  each  of  n  conditions  and  that  one  desires  to  test 
whether  or  not  the  various  conditions  have  equal  influence  upon 
performance.  Let  an  m  x  n  table  be  constructed  with  n  columns, 
representing  conditions,  and  m  rows,  representing  subjects. 

Rank  each  subject's  performance  under  the  n  conditions,  assign¬ 
ing  a  rank  of  1  to  the  smallest  score,  2  to  the  next  smallest,  etc,  , 
and  n  to  the  largest.  Then  record  each  rank  in  the  appropriate 
cell  of  the  m  x  n  table.  The  cell  entries  in  each  row  of  the 
table  constitute  one  permutation  of  the  sequence  of  integers  from 
1  to  n.  There  are  nl  possible  permutations  for  a  given  row. 

For  each  such  permutation,  there  are  ni  ways  of  permuting  a 
second  specified  row,  etc.  Since  there  are  m  rows,  there  are 

(n]  )^  different  ''tables**  which  can  be  obtained  by  permuting 
cell  entries  within  rows. 


Now  sum  the  cell  entries  in  each  column.  The  average 

^  .  .  n+1  ,  .  /H+l. 

cell  entry  in  a  row  is  ,  so  the  average  column  sum  is  m(  ^  ). 

n_j_  j 

From  each  column  sum  subtract  m  (— ^-  )  to  obtain  the  deviation  of 

the  column  sum  from  the  value  expected  if  conditions  have  equal 
effects  upon  performance.  Square  each  deviation  and  sum  the 
squared  deviations.  Call  this  sum  S. 


For  each  of  the  (n*  )^  possible  tables  there  will  be  a 
corresponding  value  of  S  (some  tables,  of  course,  yielding  the 
same  S).  Therefore  the  exact  probability  for  an  S  equal  to  or 

greater  than  that  obtained  is  simply  the  number  of  the  (n!  )^ 
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different  possible  tables  which  yield  such  values  of  S,  divided 

by  {nl  (Some  of  the  (n!  possible  tables  differ  only  in 

that  entire  columns  are  interchanged.  Since  there  are  n  col¬ 
umns,  there  are  nl  variations  of  any  given  table  which  can  be 
effected  simply  by  transposing  columns.  All  of  these  nl  var¬ 
iations,  of  course,  yield  the  same  S.  Therefore  computations 
can  be  considerably  lessened  by  counting  critical  values  of  S 
only  from  the  (n!  tables  none  of  which  can  be  obtained  by 

transposing  columns  of  another  table  in  the  set.  If  an  S  as  great 
or  greater  than  that  actually  obtained  could  have  been  obtained 

from  N  of  the  (nl  *’unr estricted**  tables  and  from  N*  of  the 

(nl)^~^  "restricted”  ones,  then  N/(nl  and  NV(nl)^*^  are 
equal,  and  both  give  the  exact  probability  sought.  ) 

When  m  and  n  are  small  exact  probabilities  can  be 
calculated  in  the  manner  indicated  above.  Such  exa^t  prob¬ 
abilities  have  been  tabled  for  S  and  for  a  statistic,  x  which 

1  2S 

equals  - ; — ; —  and  is  therefore  equivalent  to  S  when  exact 

^  mn(n+l)  ^ 

tables  are  used. 


Owing  to  the  effect  described  in  the  Central  Limit 
Theorem,  the  distribution  of  column  means  approaches  a  normal 
distribution  as  the  number  of  rows  increases,  thus  making  pos¬ 
sible  an  approximate  test  when  m  and  n  exceed  the  values  given 
for  them  in  the  exact  tables.  The  mean  and  variance  of  a  single 

n't  1  n^  1 

table  entry  are  -  and  -  respectively.  The  mean  of  the 


entries  is  also  the  mean  of  the  column  means.  The  variance  of 
a  mean  of  m  observations  is  1/m  times  the  variance  of  the  individ¬ 
ual  observations  upon  which  the  mean  is  based.  Therefore  the 


variance  of  a  column  mean  is 


1  2m 


If  R  is 

j 


R  -Hli 


ranks  in  the  column,  then  — 


^  IZrn 


IS, 


the  mean  of  the 


for  large  values  of 
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m,  approximately  a  standardized  normal  deviate  with  zero  mean 
and  unit  variance.  The  sum  of  the  squares  of  n  independent 
standardized  normal  deviates  is  distributed  as  chi  square  with 
n  degrees  of  freedom.  However,  the  n  column  means  are  not 
independent;  knowing  n-1  of  them,  the  remaining  mean  can  be 

obtained  by  subtracting  their  sum  from  ^  .  If  one  mean 

is  ’ignored**,  however,  the  remaining  n-1  means  may  be  regarded 
as  practically  independent.  Therefore,  n-1  of  the  column  means 
could  be  selected  at  random  and  used  to  calculate  n-1  standardized 
normal  deviates  the  sum  of  whose  squared  values  would  be  distri¬ 
buted  as  chi  square  with  n-1  degrees  of  freedom.  This  approach, 
however,  is  objectionable  because  the  "information”  contained  in 
the  arbitrarily  discarded  mean  is  ignored.  The  solution  favored 
by  Friedman  (11)  is  to  find  the  sum  of  the  squares  of  all  n  standard¬ 
ized  normal  deviates,  divide  by  n  to  obtain  the  average  squared 
standardized  normal  deviate  ,  then  multiply  this  by  n-1  to  obtain 
a  simulated  sum  of  n-1  squared  standardized  normal  deviates  which, 
nevertheless,  takes  all  n  of  the  deviates  into  account.  The  result¬ 


ing  value. 


,  is  distributed  approximately 


12m 


as  chi  square  with  n-1  degrees  of  freedom.  For  computational 
purposes,  it  is  easier  to  use  the  equivalent  formula 


mn(n+l) 


12 


R.  with  n-1  degrees  of  freedom. 


where  R.  is  the  sum  of  the  ranks  in  the  column  and  X  sym- 


2 

bolizes  a  modified  X  which  has  approximately  the  chi  square 
distribution. 


b.  Null  Hypothesis.  For  each  row,  each  of  the  n! 


permutations  of  the  ranks  1  to  n  was  equally  likely  to  be  the 
sequence  of  cell  entries  recorded.  This  will  be  the  case  if 
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conditions  have  equal  influence  upon  scores  (so  that  variations  in  a 
subject*s  performance  are  random)  and  if  all  assumptions  are  true. 
Note:  the  null  hypothesis  does  not  imply  that  the  observations 
in  different  rows  come  from  the  same  population. 

c.  Assumptions.  Observations  upon  a  subject  are  randomly 
selected  (usually  it  is  also  assumed  that  subjects  are  randomly  se¬ 
lected),  rows  are  independent,  i,  e.  ,  one  subject’s  performances 
are  uninfluenced  by  the  performance  of  any  other  subject,  and 
within  a  single  row  there  are  no  tied  ranks  (thus,  if  ‘‘performance** 
is  not  intrinsically  in  rank  form,  it  is  assumed  to  be  continuously 
distributed).  If  the  approximate  test  is  used,  it  must  be  further 
assumed  that  m,  the  number  of  rows,  is  large  enough  so  that,  in 
accordance  with  the  Central  Limit  Theorem,  the  mean  of  the  m 
ranks  in  a  column  will  be  essentially  normally  distributed  about 

the  grand  mean  of  . 


d.  Treatment  of  Ties,  The  conservative  method  of  dealing 
with  within-row  ties  is  to  distribute  the  tied-for  ranks  to  tied  cells 
in  such  a  way  as  to  minimize  S,  In  order  to  minimize  error  in  the 
long  run,  give  each  of  the  within-row  ties  the  average  of  the  tied- 
for  ranks. 


e.  Efficiency.  When  n  =  2,  the  present  test  is  equivalent 
to  the  sign  test  which  has  an  asymptotic  efficiency  of  .  637  relative 
to  Student’s  t  test.  Therefore,  when  n-2  and  m  is  infinite,  the 
present  test  has  an  efficiency  of  ,637  relative  to  Student’s  t  (11). 
This  is  presumably  the  lowest  efficiency  value  assumed  by  the  test 
since  the  efficiency  of  the  sign  test  increases  with  decreasing 
sam  pie  size  and  since  when  n-2  the  ranks  in  effect  designate  only 
“smaller*’  versus  “larger**,  while,  with  increasing  n,  finer  and 
finer  gradations  of  discrimination  are  possible.  Thus,  with  in¬ 
creasing  n,  ranks  simulate  more  and  more  closely  the  gradations 
of  measurement  characteristic  of  continuously  distributed  original 
scores  and  efficiency  should  approach  that  of  tests  based  on  such 
scores.  At  the  other  extreme,  when  m-2,  the  test  is  equivalent 
to  the  rank  difference  correlation  test  shown  by  Hotelling  and  Pabst 
to  have  an  asymptotic  efficiency  of  .912  relative  to  the  parametric 
test  for  correlation.  This  then  is  the  efficiency  of  the  present  test 
when  m*2  and  n  is  infinite  (H).  It  seems  reasonable  to  conclude. 
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therefore,  that  the  present  test  when  applied  to  normally  distributed 
original  scores  has,  relative  to  parametric  tests,  an  efficiency  no 
smaller  than  .  637  and  generally  considerably  higher.  (If  either 
n-Z  and  m  is  very  small,  or  if  m  and  n  are  both  quite  large,  one 
would  expect  an  efficiency  close  to  1.  00). 

f.  Application.  Each  of  three  subjects  performs  a  well 
learned  task  three  times,  each  time  under  the  influence  of  a  dif¬ 
ferent  drug.  Performance  is  timed  and  the  experimenter  wishes 
to  test  the  hypothesis  that  no  subject's  performance  times  were  in¬ 
fluenced  more  by  one  drug  than  by  another. 


TIME  SCORES 
Drug 


Subject 

I 

II 

III 

A  j 

4.76 

1. 30 

7.91 

1 

B  ; 

14.  51 

10.  27 

35.  84 

i 

C  i 

82.  11 

82.09 

i 

[82.  14 

i 

TIME-SCORE  RANKS 
Drug 


Subject 

I 

n  1 

1  III 

A 

2 

1 

3 

1 

B 

2 

1 

3 

C 

2 

i 

1 

1  ! 

! 1 

1 

!  3 

t 

SUM  6  3  9 


The  original  scores  are  shown  above,  a  second  table  substituting 
rcinks  for  scores.  For  the  latter  table,  the  average  column 


sum  is  m 


or  6,  so  the  deviations  of  the  column  sums  from 


their  mean  value  are  0,  -3,  and  3.  Squared  these  become  0,  9, 
and  9  and  their  sum  S  is  18.  Consulting  Kendall's  exact  tables  it 
is  found  that  an  S  of  18  has  a  chance  probability  of  .  028  of  being 
equalled  or  exceeded.  (The  test  is  one-tailed  since  S  can  only 
be  positive  and  since  very  small  values  of  S  only  indicate  unlikely 
degrees  of  "agreement"  with  the  null  hypothesis.  ) 


The  same  result  could  have  been  calculated  without  resort 
to  tables.  There  are  3!  ways  of  assigning  the  integers  1,  2,  and  3 
to  the  three  cells  in  row  B  and  for  each  of  these  permutations,  there 
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are  3i  ways  of  permuting  the  ranks  in  row  C.  Thus  there 
are  31  x  3!  =  36  tables  which  can  be  constructed  without  altering 
the  rankings  in  row  A.  For  each  of  these  tables  there  is  a  set  of 
column  sums  and  a  corresponding  value  of  S.  The  actually  ob¬ 
tained  set  of  column  sums,  however,  differ  maximally  and  can  be 
dbtair^d  in  only  one  way  if  AJs  ranking  is  held  constant  (any  permu¬ 
tation  of  B*s  or  C*s  ranks  bring  the  sums  closer  together  and  re¬ 
duce  S),  Therefore  the  obtained  table  yields  the  maximum  value 
of  S  which  can  be  obtained  in  only  one  of  36  tables,  and  the  prob¬ 
ability  of  an  S  equal  to  or  greater  than  that  obtained  is  1/36  or  ,  028. 
The  hypothesis  of  equal  drug  effects  can  therefore  be  rejected  at 
beyond  the  .  05  level  of  significance. 

g.  Discussion.  Pitman  (27)  extended  Fisher's  Method  of 
Randomization  for  matched  observations  from  the  two  treatment 
case  to  the  case  of  multiple  treatments.  The  present  test,  when 
used  as  an  exact  test,  differs  from  that  proposed  by  Pitman  only 
in  that  ranks  have  been  substituted  for  original  observations  and 
the  test  statistic  is  S  rather  than  the  F  ratio.  The  further  exten¬ 
sion  of  the  test  from  its  present  requirement  of  one  observation 
per  cell  to  the  case  where  any  cell  can  be  empty  or  contain  any 
positive  number  of  observations  has  been  discussed  by  Benard 
and  van  Elteren  (3). 

The  present  test  is  exact  only  when  probabilities  are  ob¬ 
tained  by  the  Method  of  Randomization,  When  they  are  obtained 

'  2 

from  the  Z  or  X  distribution,  (see  ''Tables**)  they  are  approx¬ 
imate.  For  values  of  m  and  n  slightly  larger  than  those  for  which 
exact  probabilities  of  S  have  been  tabled,  the  approximate  probabil¬ 
ities  obtained  by  using  the  Z  tables  are  reasonably  close  to  the  true 
values  at  the  .  05  or  .  01  levels  of  significance;  however,  the  .  001 
level  of  significance  should  be  avoided.  The  tails  of  the  distribu¬ 
tion  of  S  are  very  irregular  when  m  and  n  are  in  this  region. 

It  is  to  be  noted  that  the  test  does  not  assume  homogeneity 
of  rows.  The  "subjects’*  may  belong  to  different  populations  and 
their  absolute  performance  scores  under  a  given  condition  may  differ 
tremendously.  Furthermore,  the  variability  of  performance  under 
the  various  conditions  may  differ  vastly  from  one  subject  to  another. 
The  test  is  not  designed  to  detect  such  effects.  It  essays  merely 
to  detect  any  systematic  tendency  for  performance  under  one  condition 
to  be  superior  to  that  under  another  condition.  It  will  fail  if  such 
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a  tendency  exists  in  some  rows  but  is  balanced  by  an  opposite  ten¬ 
dency  in  other  rows.  Therefore,  while  homogeneity  of  rows  is  not 
assumed  by  the  test,  it  will  generally  be  desirable  to  select  sub¬ 
jects  from  the  same  population.  If  this  is  not  done,  and  if  such 
subjects  are  not  selected  randomly  so  as  to  be  "representative"  of 
their  populations,  the  results  of  the  test  will,  in  a  sense,  be  pe¬ 
culiar  to  the  group  actually  tested. 

The  preceding  test  can  be  used  to  test  for  interactions  (53). 
The  method  can  be  best  explained  in  terms  of  an  example.  Suppose 
that  four  subjects  have  performed  under  each  of  three  conditions,  I, 
II  and  III,  of  one  variable  and  have  done  so  under  each  of  two  condi¬ 
tions,  A  and  B,  of  another  variable.  (The  significance  of  each  var¬ 
iable  alone  can  be  tested,  by  collapsing  data  over  the  other  variable 
and  performing  the  test  as  described  earlier.  )  It  is  desired  to  test 
whether  the  two  variables  interact.  Let  the  data  be  as  shown  below; 

COLUMN 


BLOCK 

ROW  1 

(Subject)  i 

I 

1 

1 

II 

III 

1 

15.  4 

26.  9 

27.  8 

2 

14.  6 

25.  9 

28.  7 

A 

3 

8.  3 

14.  2 

12.  0 

4 

5.  9 

i 

19.  9 

20.  3 

1  i 

i  9.  2 

15.  1 

18.  7 

2 

5.  1 

10.  2 

15.  4 

B 

3 

4.  9 

8.  2 

6.  1 

4  i 

11.5 

12.  5 

29.  1 

Now  subtract  each  score  in  block  B  from  the  corresponding  score 
in  block  A  and  form  the  table  shown  below: 
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"A"  Observations  Minus  "B"  Observations 


COLUMN 


ROW  i 

I 

II 

in 

I 

6.2 

II.  8 

9.  I 

2  ! 

9.5 

15.7 

13.  3 

3 

3.4 

6.  0 

• 

in 

4 

-5.  6 

7.4 

00 

00* 

The  preceding  test  is  then  applied  to  this  table  in  the  usual  mcinner 
as  shown  below,  ranks  being  substituted  for  difference-scores. 


ROW 


1 

2 

3 

4 


1 

1 

1 

2 


COLUMN 


II  III 


3  2 

3  2 


3  2 


3  1 


The  column  sums  are  5,  12,  cund  7,  and  since  the  average  sum  is 
8,  the  squared  deviations  from  the  mean  sum  are  9,  16,  and  I 
yielding  an  S  of  26  which  is  significant  at  the  .  042  level 

If  there  had  been  three  blocks.  A,  B  and  C,  two  tables 
would  have  been  constructed,  one  for  the  A-B  differences  and  one 

A  +  B 

for  the  differences,  - - -  -  C,  or  the  ultimately  equivalent  dif- 
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ferences,  (A  +  B)  -  ZC.  The  statistic 


12S 

mn(n+l ) 


is  then  cal¬ 


culated  for  each  table  after  substituting  ranks  for  differ  ence  -  scores . 

2 

The  sum  of  these  two  X  *s  is  distributed  approximately  as  chi- 

square  with  2(n-l)  degrees  of  freedom.  With  four  blocks,  A,  B, 

C  and  D,  three  tables  would  be  constructed,  one  each  for  the  differ¬ 
ences  A-B,  A  +  B-2C,  and  A+B+C-3D.  Ranks  would  be  substituted 

2 

for  difference  scores  and  X  would  be  calculated  for  each  table. 

r 

2  . 

Since  each  X  ^  is  approximately  distributed  as  chi-square  with 

n-1  degrees  of  freedom,  by  the  additive  property  of  chi-square 
their  sum  is  distributed  as  chi-square  with  3(n-l)  degrees  of 
freedom. 


The  hypothesis  tested  is  that,  except  for  chance  fluctuations, 
each  score  in  the  i^^  row  of  one  block  differs  by  a  constant  amount 
from  the  corresponding  score  in  the  i^^  row  of  another  specified 
block.  If  this  is  not  the  case  then  the  influence  of  columns  upon 
entries  of  the  i^^  row  depends  upon  blocks  and  a  column-block  in¬ 
teraction  exists. 

Since,  in  each  row,  ranks  are  substituted  for  original 
observations,  the  method  is  particularly  suitable  when  original 
data  are  in  intrinsic  rank  form,  each  row  containing  the  ranks  from 
1  to  n.  This  is,  in  fact,  the  case  when  each  of  m  judges  ranks  each 
of  n  things,  tied  ranks  being  disallowed.  The  present  test  will  test 
the  judges*  accuracy.  Their  reliability,  i.  e.  ,  agreement  with  one 
another  rather  than  with  the  **true"  ranking,  however,  is  also  of 
some  interest  and  can  be  tested  by  means  of  distribution-free  tests 
originated  by  Kendall  (see  "Miscellaneous  Distribution-Free 
Tests")  and  others  (6), 

h.  Tables.  The  exact  probability  that  S  will  equal  or  ex¬ 
ceed  a  given  value  has  been  tabled  (14,  15,  l6)  for  the  cases:  n=3, 
2<m<10;n=4,  2<m<6;  and  n=  5,  m=3.  Analogous  exact  prob- 

2 

abilities  for  X  ^  have  been  tabled  (11,  1-43)  for  the  cases:  n-3, 
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2<m<9;n=4,  2<m<4;  (in  the  actual  notation  used  P  is  sub¬ 
stituted  for  n  above  and  n  for  m  above). 


For  cases  not  covered  by  the  above  tables,  close  approx¬ 
imate  probabilities  can  be  obtained  by  entering  Fisher*s  Z  tables 


(given  in  14)  with  degrees  of  freedom 

.  v  1  1  12S  (m-1) 

and  with  Z  *  —  log  - ^ - L-  or, 

2  e  Z 

m  (n  -n)-12S 


-  n-1  — ^  and  V_  -  (m-l)V 
m  ^  1 

somewhat  more  accurately, 


with  Z  corrected  for  continuity,  Z 


1  12(S-l)(m-l) 

2 

m  (n  -n)-l 2(S-3) 


Using  the  above  formula,  with  correction  for  continuity,  tables 
have  been  prepared  which  give  values  of  S  significant  at  the  .  05 
and  .  01  levels  for  the  cases  3  <  n  <  7,  m  =  3,  4,  5,  6,  8,  10,  15 

or  20  (10,14,  1-43).  Using  the  identity  =  - - - -  these 

^  mn(n+l) 


2 

tables  can  be  ^’translated”  into  analogous  tables  for  X^.  This. in 

effect,  has  been  done  (10),  the  tables  being  expanded  to  cover  the 
additional  cases  m  *  100  and  m  ■  infinity,  A  nomogram  based  upon 
still  another  approximation  is  available  in  (30). 


If  the  statistic  X^ 

r 


is  used  instead  of  S, 


close  approximate 


2 

probabilities  can  be  obtained  by  substituting  mn(n+l)X  j./l2  for  S  in 
one  of  the  formulae,  given  above,  for  Z,  and  then  consulting  the  Z 
tables,  A  less  close  approximation  to  exact  probabilities  can  be 
somewhat  more  readily  obtained  by  entering  the  chi-square  tables 


with  n-1  degrees  of  freedom  and  with  X^ 

r  ..  ..  ^-2  1 2m(n- 1  )(S- 1 ) 

for  continuity  X  =  - i , 

m^(n^-n)+  24 


12S 


mn(n+l ) 


,  or,  corrected 


2 

When  Z  or  X  tables  are  used  to  obtain  probabilities,  cor¬ 
rections  for  ties  may  be  made.  These  are  given  by  Kendall  (14). 
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2 

The  simplest  procedure  appears  to  be  to  correct  X  ^  for  ties  and 

2 

then  find  the  S  corresponding  to  this  value  of  X  Corrected  for 


ties,  X^  -  - - -  where  T  =  (t^-t),  t  being  the 

r  12  ^ 

mn(n+l)  T 

12  n-1 

number  of  observations  in  a  particular  row  which  are  tied  for  the 
same  rank,  the  summation  of  (t^-t)  occurring  over  all  tied-for 
ranks  in  that  particular  row  and  the  summation  of  T  occurring 
over  all  rows, 

i.  Sources.  3,  6,  9,  10,  11,  13,  14,  15,  16,  27,  30,  32, 
34,  35,  40,  41,  42,  51,  53,  54. 


3,  Median  Tests 


a  Rationale,  Suppose  that  (continuously  distributed)  ob¬ 
servations  have  been  taken  under  C  experimental  conditions  and  that 
it  is  desired  to  test  whether  or  not  the  conditions  have  equal  effects. 
Let  n  be  the  total  number  of  observations  and  a  be  the  number  of 
those  observations  whi(!h  lie  above  the  grand  median,  M,  and  let 
n.  be  the  number  of  observations  taken  under  the  i^^  experimental 

condition  and  a-  be  the  number  of  those  n.  observations  which  lie 
1  1 

above  the  grand  median  for  all  observations. 


1 - i 

^1 

1  i 

a 

2 

1 - ^ 

a 

c-1 

a 

c 

a 

n  -a 

2  2 

n  -a 

c-1  c-1 

n  -a 
c  c 

n-a 

^1 

^2 

n 

c-1 

n 

c 

n 
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Suppose  that  the  conditions  do  have  equal  effects  so  that  all 
n  observations  are  from  a  common,  continuously  distributed,  popu¬ 
lation.  If  P  is  the  proportion  of  the  population  lying  above  the  grand 
median  of  the  n  observations,  the  a  priori  probability  that  a.  of  the  n. 

th,  ^  ^ 

observations  taken  under  the  i  condition  will  exceed  M  is 

n.  n.-a. 

(/)  ^  i  (1-P)  ,  and  the  a  priori  probability  that  under  successive 

a .  ir 
1 

conditions  the  number  of  observations  above  the  median  will  be 

n.  a.  n.-a. 

a  ,  a  ,  ...,  a  is  the  product  11  (  ^)  P  ^  (1-P)  ^  However, 

X  w  c  >  ^  a . 

1=1  1 

the  value,  M,  used  to  dichotomize  the  data  into  the  frequency  cate¬ 
gories  a^  and  ^  sample,  not  a  population,  median,  i.e., 

it  was  determined  a  posteriori.  Therefore,  the  probability  we  seek 

is  the  conditional  probability  of  cell  entries  a. ,  a  ,  .  •  .  ,  a  ,  given 

1  ^  c 

that  their  marginal  total  is  a,  and  this  is  obtained  by  dividing  the  a 

n.  a.  n.-a. 

priori  probability  II  (  ^)  P  (1-P)  ^  by  the  a  priori  probability 

.  1  a . 

1=1  1 

of  the  marginal  totals,  which  is  (^)  P^  (1-P)^  In  the  resulting 

a 

fraction  the  terms  containing  P  cancel  out  leaving 


n 


n. 

■ ", 

1=1  1 

C") 

'a' 


as  the  point  probability  for  the  obtained  table. 


The  signi¬ 


ficance  level  for  a  given  table  is  obtained  by  cumulating  the  point 
probabilities  for  all  tables  as  extreme  or  more  so.  However,  with 
increasing  values  of  n.  or  C  calculations  are  likely  to  become  pro¬ 


hibitively  laborious.  Fortunately,  when  n  >  20  and  all  m  >  5  a 
fairly  good  approximate  test  can  be  performed  by  calculating 


n  (n-  1)  — .  c 

a  (n-  a)  ^ 
i=  1 


(a. 


n.a 

— )^ 

n 


n. 

1 


which  is  distributed  very  nearly  as  chi- 
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square  with  C-1  degrees  of  freedom, 

b.  Null  Hypothesis,  The  probability  that  an  observation 
will  be  above  the  grand  sample  median,  M,  is  independent  of  the 
experimental  condition  under  which  the  observation  was  taken. 

This  will  be  the  case  if  conditions  have  equal  effects  and  if  all 
assumptions  are  met, 

c.  Assumptions,  Sampling  is  random,  observations  are 
independent,  and  there  are  no  tied  observations,  i.  e*  ,  the  sampled 
populations  are  continuously  distributed.  If  the  large  sample  ap¬ 
proximation  is  used,  then  all  of  the  assumptions  of  chi-square  are 
also  introduced, 

d.  Treatment  of  Ties,  Tied  observations  are  no  problem 
unless  they  are  tied  with  the  median.  In  this  case,’  if  the  proportion 
of  such  ties  is  small,  the  following  procedure  is  recommended. 

Either  (a)  resolve  all  ties  in  the  manner  least  conducive  to  rejection 
of  the  null  hypothesis,  or  (b)  under  each  condition  separately  count 
half  of  the  observations  tied  with  the  grand  median  as  above  it,  half 
as  below  it,  and  treat  an  odd  tie  as  outlined  in  (a)  above, 

e.  Efficiency,  When  both  tests  are  applied  to  populations 
having  normal  distributions,  differing  only  in  location  (and  therefore 
having  equal  variances)  the  median  test  has,  relative  to  the  F  test 
of  analysis  of  variance  an  asymptotic  relative  efficiency  of  Z/tt  or 

,  637,  Under  the  same  circumstances  it  has  an  A.R.E,  of  2/3 
relative  to  Kruskal  and  Wallis*  H  test.  If,  in  the  above,  ''uniform 
distribution**  is  substituted  for  "normal  distribution",  the  median 
test  has  A,R,E,  of  1/3  relative  to  the  F  test  and  also  relative  to 
the  H  test.  For  other  types  of  distribution,  the  A,R,E,  of  the  median 
test  relative  to  either  the  F  or  H  tests  can  be  less  than,  equal  to, 
or  greater  than,  1,  depending  upon  the  particular  distribution  to 
which  applied  (2), 

The  median  test  is  consistent  against  translation  alter¬ 
natives  (2), 

f.  Application,  Suppose  that  16  rats  have  been  randomly 
selected  from  a  common  population  and  randomly  divided  into  three 
groups.  Each  group  is  administered  a  different  drug  after  which 

the  time  to  rim  a  maze  is  measured  for  each  rat.  The  null  hypothesis 
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is  that  the  three  drugs  have  equal  effects  upon  maze  running  ability 
Let  the  data  be  as  shown  below. 


Maze  Running  Times  Under 


Drug  A 

Drug  B 

Drug  C 

267 

269 

215 

271 

283 

231 

285 

288 

233 

299 

302 

252 

304 

306 

255 

264 

The  grand  sample  median  lies  between  269  and  271,  therefore,  in 
terms  of  frequencies  of  observations  above  and  below  the  median 
the  above  table  becomes: 

A 

B 

C 

Totals 

Above  Median  4 

4 

0 

8 

Not 

Above  Median  1 

1 

1 

6  ! 

8 

5 

5 

6 

16 

The  first  row  of  tables  as  extreme  or  more  so,  and  their  point 
probabilities  are  given  below: 
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A 

B 

C 

n. 

Point  Probability  =  11  (  ^)  /  (^) 

a .  a 

1 

5 

3 

0 

10/12870 

f  3 

5 

0 

10/12870 

4 

i  ; 

i  4 

0 

25  /12870 

1 

1 

6 

25  /12870 

2 

0 

6 

10/12870 

0 

2 

6 

i 

1 

10/12870 

The  cumulative  probability  is  90/12870  or  .  007.  Recalculating 
the  probability  using  the  chi-square  approximation,  we  have 


X 


n(n-l) 

a(n-a) 


n.a 

1 


n 


n. 

1 


+ 


(0  -  (yg’) 

- 5 - 


=  9. 


16  X  15 

8x8 


5 


(4-(^)  8)^ 
+  - 5 - 


Entering  the  chi-square  tables  with  C-l=2  degrees  of  freedom,  a 
chi-square  of  9.  00  is  found  to  have  a  probability  just  slightly  larger 
than  the  •  01  level  of  significance.  Both  methods  give  a  probability 
in  the  neighborhood  of  ,  01,  but  it  is  clear  that  the  approximation  is 
not  impressively  close  to  the  true  value, 

g.  Discussion,  Mood  (24)  and  Brown  and  Mood  (5)  have 
outlined  median  tests  for  cases  analogous  to  those  encountered  in 
a  two  way  analysis  of  variance.  Exact  tests  are  theoretically  pos¬ 
sible,  for  these  cases,  but  actually  impractical  because  of  the 
laborious  computations  involved.  The  user  is  therefore  practically 
forced  to  ignore  the  exact  probability  formulae  given  by  Mood,  rely- 
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ing  instead  upon  test  statistics  which  have  approximately  the  chi- 
square  distribution.  The  tests  are  described  by  the  authors  re¬ 
ferenced  above  and  will  only  be  briefly  outlined  here. 

Test  for  main  effects  in  a  two-factor  experiment  with 
one  observation  per  cell:  To  test  for  column  effects  find  the  median 
for  each  row  and  count  the  number  of  observations  in  each  column 
which  exceed  their  respective  row  medians.  Let  a^  be  the  number 

of  such  observations  in  the  i^^  column,  let  there  be  r  rows  and  c 

c-1 

columns,  and  let  a  =  c/2  if  c  is  even  or  — —  if  c  is  odd  (note  change 

in  definitions  of  a,  and  a).  Then  each  row  contains  a  observations 
exceeding  the  row”’’  median,  and  the  table 


- '  ■  -  •  1 

1 - 1 

a 

2 

1 - : 

a 

c-1 

a 

c 

ra 

r-a^ 

r-a^ 

r-a  . 
c-1 

r  -a 

c 

r(c-a) 

r 

r 

1  _ 

1 

r 

i  I- 

i 

cr 

contains  ra  such  observations.  If  columns  have  equal  effects,  the 
expected  number  of  observations,  in  each  column,  which  exceed 
their  respective  row  medians  is  ra/c  and  the  value 


ra(c-a)^ i  c 


is  asymptotically  distributed  as  chi-square  with 


c-1  degrees  of  freedom.  Since  the  expected  frequency,  ra/c  is 
simply  the  number  of  above-row-median  observations  in  the  entire 
table  divided  by  the  number  of  columns,  its  use  implies  that  there 
are  no  interaction  effects  and  therefore  introduces  this  assumption 
(\inless  one  or  both  factors  have  randomly  chosen  levels).  An  addi¬ 
tional  assumption  is  that  all  observations  have  distributions  which 
are  identical  except  for  location.  Every  observation  within  a  row 
must,  before  sampling,  have  had  equal  probability,  under  the  null 
hypothesis,  of  exceeding  the  row  median.  This  means  that  an  equal 
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proportion  of  each  observation's  population  distribution  must  lie 
above  the  row  median.  Since  the  row  median  is  not  fixed,  but  is 
a  variable,  sample  value,  the  above  requirement  is  certain  to  be 
fulfilled  only  if  every  observation  within  the  same  row  has  a  dis¬ 
tribution  of  the  same  shape.  In  testing  for  row  effects,  a  similar 
argument  requires  that  observations  within  the  same  column  have 
distributions  of  the  same  form.  In  testing  for  both  row  and  column 
effects,  therefore,  since  a  row  observation  is  also  a  column  obser¬ 
vation,  all  observations  must  be  distributed  identically  except  for 
location.  Naturally  the  assumptions  listed  under  (c)  must  also  be 
made. 


Tests  for  various  effects  in  a  two-factor  experiment  with 
h  observations  per  cell:  Again  it  is  assumed  that  observations  have 
distributions  which  are  identical  except  for  location.  A  test  anal¬ 
ogous  to  the  ''analysis  of  variance"  test  for  main  effects  against 
interaction  can  be  made  by  performing  the  test  outlined  in  the  pre¬ 
ceding  paragraph  using  cell  medians  as  "observations".  Another 
useful  test  is  the  joint  test  for  main  effects  and  interaction.  It  tests 
the  "hypothesis  that  a  factor  has  no  effect  whatever,  either  in 

main  effects  or  in  interaction  effects".  Let  a.,  be  the  number  of 

11 

observations,  in  the  cell  formed  by  the  i^^  row  and  the  column, 
which  exceed  the  median  of  the  ch  observations  in  the  i^^  row; 

ch-  1 

and  let  a  =  ch/2,  if  ch  is  even,  or  — - —  if  ch  is  odd.  Then  if, 

as  hypothesized  there  are  no  interaction  or  column  effects,  the 
expected  number  of  observations  in  a  single  row  exceeding  a  row 
median  is  a,  and  the  expected  number  of  observations  in  a  single 
cell  exceeding  the  corresponding  row  median  is  a/c.  Thus 


c  (ch-1)  y  ,  av2 

a  (ch-a)^.  ij  “c^ 
L  J 


is  distributed  approximately  as  chi-square 


with  r(c-l)  degrees  of  freedom.  Analogous  to  the  test  of  main 

effects  against  deviations  the  following  test  can  be  performed  if 

interactions  can  be  assumed  to  be  zero.  Let  a.  be  the  number 

i 

of  the  rh  observations  in  the  i^^  column  which  exceed  their  row 


medians. 


and  let  a  = 


ch 

~ 


if  ch  is  even. 


or 


ch-  1 

2 


if  ch  is  odd. 
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r  3. 

Then  there  are  rh  observations  in  a  column,  and,  —  of  them  would 

c 

be  expected  to  exceed  their  respective  row  medians.  Thus 


c(ch-l) 

ra{ch-a) 


-2  - 


ra\2 


)  is  distributed  approximately  as  chi-square 


with  c-1  degrees  of  freedom.  Testing  for  interaction  requires 
that  column  and  row  effects  be  removed  by  subtraction  of  column 
medians  from  observations  followed  by  subtraction  of  row  medians 
the  process  being  continued  until  both  columns  and  rows  have  zero 
medians.  Let  a.,  be  the  number  of  observations  in  the  ii^^  cell 

which  exceed  its  median  plus  half  the  number  of  such  observations 

which  equal  its  median.  Let  a.  and  a  .  be  a,,  summed  over  col- 

-J  iJ 

umns  and  rows  respectively  and  let  a  be  the  sum  over  both.  Then 


ij 


a  .  a  . 

>  J 


a.  a  .  (h-a.  a  .) 

-3  1.  -J 


is  approximately  distributed  as  chi- 


square  with  {c-l){r-l)  degrees  of  freedom.  The  test  for  inter¬ 
actions  is  "very  nearly  but  not  completely  distribution-free". 


A  different  approach  to  median-test  analogues  of  analysis 
of  variance  has  been  taken  by  Wilson  (56),  the  technique  being  based 
upon  the  fact  that  a  "total"  chi-square  can  be  subdivided  into  com¬ 
ponent  chi-squares  with  component  degrees  of  freedom  (in  a  sense,  the 
reverse  of  the  additive  property  of  chi-square).  In  a  table  with  r 
rows  and  c  columns,  let  n  be  the  total  number  of  observations,  n^^ 

be  the  number  of  observations  in  the  cell  formed  by  the  i^^  row  and 
f  h 

the  column,  f . .  and  ,  f . .  be  the  number  of  this  cell's  observations 

a  ij  b  ij 

which  are  respectively  above  and  below  the  grand  median  for  the 

entire  table,  and  n  and  n.  the  total  number  of  observations  above 

a  D 

and  below  the  grand  median.  Finally  let  a  dot,  in  place  of  a  sub¬ 
script,  indicate  summation  over  all  values  of  that  subscript.  Form- 

2 

ulae  for  the  total  chi-square,  X  ,  the  row  and  column  chi-squares, 

2  2  ^ 

X  and  X  ^  are  as  follows: 
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+ 


with  rc  - 1 


=  12 


(  f..  - 
a  ij 


n. .  n 
n 


n. .  n 
ij  ^ 

n 


(  f. .  - 
b  ij  n 

n. .  n- 
ij  b 

n 


degrees  of  freedom,  the  expected  frequencies  having  been  derived 
from  **the  null  hypothesis  that  the  main  effects  and  interaction  effects 
produce  no  change  in  the  distribution  of  scores”, 


X 


2 

R 


f 

a  1 


n. 

1. 


n 

a 


n 


n.  n 
1.  a 


+ 


n 


n. 

1 


"1.  Vz  1 

n 

n 


with  r-1  degrees 


of  freedom, 


( 

)  — 

i 


f  . 


n  .  n 


n 


n  .  n 
»J  a 


n 


+ 


^b^.j 


".■i"b 

n 


".j"b 


n 


with  c-1  degrees  of  freedom,  expected  frequencies,  in  both  cases, 
having  been  obtained  from  "the  null  hypothesis  that  the  distributions 
of  scores  are  identical  for  all  levels  of  the  row  or  column  conditions". 

2 

Finally  an  interaction  chi-square,  X  j,  is  obtained  by  subtraction, 
x\  =  -  x\  -  X^^  with  (rc-1)  -  (r-1)  -  (c-1)  =  (r-l)(c-l)  de¬ 

grees  of  freedom.  Computational  formulae  for  extension  of  the  tech¬ 
nique  to  the  three-factor  case  have  been  published  by  Alluisi  (1). 

It  has  been  pointed  out  that  the  test  compares  poorly  with  the 
analysis  of  variance  in  cases  where  the  assumptions  of  the  latter  test 
have  been  met  (23,  39).  Sheffield  (33)  has  objected  that  an  entirely 
equivalent  test  can  be  performed  using  analysis  of  variance  techniques. 
Frequencies  of  "above"  or  "below  median"  are  treated  as  scores  and 
their  within-cell  variance  is  known  to  be  that  of  a  binomially  distri¬ 
buted  variate  and  can  therefore  be  specified  a  priori.  With  this 
information  the  analysis  of  variance  is  conducted  upon  frequencies. 
"The  implications  of  these  F  tests  are  exactly  the  same  as  those  of 
2  2 

Wilson's  X  analysis.  In  fact,  since  X  divided  by  its  df  is  distri¬ 
buted  the  same  as  F  for  infinite  df  in  the  smaller  variance,  the  present 

2 

F  values  can  be  transformed  into  Wilson's  X  values  by  multiplying 
F  by  df  .  .  .  "  Sheffield  comments  further  that,  whichever  approach  is 
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used,  severe  restrictions  are  imposed  by  confining  tests  to  those 
which  can  be  performed  using  the  single  (within-cell)  error  term 
which  is  determinable  a  priori  and  therefore  distribution-free. 

h.  Tables .  There  appear  to  be  no  tables  for  the  exact 
methods,  so  in  these  cases  probabilities  must  be  computed. 
Ordinary  chi-square  tables  are  used  with  ftie  approximate  methods. 

i.  Sources.  1,  4,  5,  22,  23,  24,  28,  33,  39,  56. 


4.  Contingency  Tables 

Several  ingenious  distribution-free  tests  have  been  devised 
to  examine  the  significance  of  effects  in  an  r  x  c  table  whose  cell 
entries  consist  of  frequencies  rather  than  "scores’*  (4,  7,  8,  14,  17, 
31,  36,  38,  55). 

Suppose  that  columns  are  "treatments"  whose  outcomes 
are  categorized  only  according  to  the  dichotomy,  "success"  and 
"failure",  and  suppose  that  rows  are  "subjects"  so  that  data  within 
a  row  are  matched,  each  subject  receiving  all  treatments.  If  a 
success  is  obtained  on  the  i^^  subject  for  the  j^^  treatment,  a  1  is 
recorded  in  the  ij  cell;  if  the  treatment  is  a  failure,  a  zero  is  ent¬ 
ered  in  the  cell.  Let  |jl^  be  the  marginal  total,  i.  e.  ,  the  number 

of  successes,  for  the  row,  Tj  be  the  marginal  total  for  the  j^^ 

column,  and  T  be  the  mean  column  sum.  If  the  number  of  rows 
is  large,  column  totals  will  tend  to  have  normal  distributions, 
and  if  treatments  have  equal  effects,  these  distributions  will  have 
equal  variances  and  a  common  mean.  As  a  consequence,  the 

c(c-l)y(T.  -  T)^ 

Lj  J 

statistic  Q  =  -  will  be  distributed  ap- 

c  -  (T 

proximately  as  chi-square  with  c-1  degrees  of  freedom.  This 
test  has  been  proposed  by  Cochran  (7)  as  a  statistical  solution  for 
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the  case  where  matching  invalidates  the  usual  chi-square  test 
for  contingency  tables.  This  test  and  certain  median  tests  are 
special  cases  of  more  general  tests  of  dichotomized  data  out¬ 
lined  by  Blomqvist  (4). 

Suppose  that  row  categories  represent  gradations  or 
subcategories  of  a  single  variable,  the  level  of  the  gradation  or 
subcategory  increasing  or  decreasing  monotonically  in  progressing 
from  the  first  to  the  last  row,  and  suppose  that  a  similar  condition 
exists  for  columns,  A  test  for  association  in  the  contingency  table 
may  be  regarded  then  as  a  test  for  correlation  between  the  column 
variable  and  the  row  variable.  If  the  1st  column  is  regarded  as 
having  a  rank  of  1,  the  2nd  as  having  a  rank  of  2,  etc.  ,  and  like¬ 
wise  for  rows,  then  the  frequency  in  the  i^^  row  may  be  considered 
the  number  of  units  tied  for  a  rank  of  i  on  the  row  variable,  and  sim¬ 
ilarly  the  number  of  units  tied  in  the  column  would  be  the  number 
tied  for  a  rank  of  j  on  the  column  variable.  Finally,  the  frequency 
in  the  i  cell  would  be  the  number  of  units  tied  for  a  rank  of  i  on 
the  row  variable  which  are  also  tied  for  a  rank  of  j  on  the  column 
variable.  Thus  the  situation  is  analogous  to  that  in  which  corre¬ 
lation  is  to  be  measured  between  ranked  variates  when  both  rank¬ 
ings  contain  ties,  Stuart  (38)  proposes  to  calculate  Kendall*s 
rank  correlation  statistic,  S,  in  this  case  by  multiplying  each  cell 
frequency  (a)  positively  by  the  sum  of  the  frequencies  in  all  cells 
lying  below  it  and  to  the  right,  (b)  negatively  by  the  sum  of  the 
frequencies  in  all  cells  lying  below  it  and  to  the  left,  (frequencies 
for  cells  in  the  same  row,  the  same  column,  or  above,  are  ig¬ 
nored):  the  sum  of  (a)  plus  (b),  taken  over  all  cells,  is  S.  If  the 
number  of  rows  equals  the  number  of  columns,  the  significance  of 
S  can  then  be  tested  by  techniques  taking  account  of  ties  in  the 
application  of  KendalFs  test  for  rank  correlation.  Otherwise  the 
test  can  be  performed  using  asymptotic  formulae  given  by  Stuart. 


5,  Tests  for  a  Divergent  Population 

a.  Rationale.  Suppose  that  an  experimenter  has  a  sample 
from  each  of  k  continuously  distributed  populations  with  identical 
forms  and  wishes  to  test  the  hypothesis  that  all  populations  have 
the  same  location  against  the  alternative  hypothesis  that  one  popu- 
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lation  has  a  larger  location  parameter  than  the  rest.  The  sample 
containing  the  largest  observation  is  determined,  and  in  it  the  ex¬ 
perimenter  counts  the  number,  r,  of  observations  which  exceed 
all  observations  in  all  other  samples.  If  n^  is  the  size  of  the  i^^ 

sample  and  N  is  the  total  number  of  observations  in  all  samples, 
then  there  are  n^(n^-l)  .  .  .  (n^-r+1)  or  n^!  /(n^-r)l  ways  in  which 

the  r  largest  observations  could  have  been  placed  in  the  i^^  sample 
and  N(N-l)  .  ,  .  (N-r+1)  or  NI  /(N-r)i  ways  in  which  they  could  have 
been  located  without  restriction.  The  probability  that  the  r  largest 
observations  will  all  be  in  a  preselected  sample  is  therefore 


n,!  /(n.-r): 
1  1 

Nl  /(N-r)l 


,  and  the  probability  that  they  will  all  be  in  some 


/  n.  i  /(n,  -  r)J 


one  of  the  k  samples  is  Pr(r)  = 


i=l 


Ni  /(N-r): 


Since  in 


the  derivation  it  was  not  required  that  the  (r+l)st  largest  observa¬ 
tion  be  located  in  a  different  sample,  the  above  probability  is  the 
probability  that  r  or  more  of  the  largest  observations  will  be  located 
in  a  single  sample. 

b.  Null  Hypothesis,  The  probability  that  any  given  one  of 
the  r  largest  observations  will  be  located  in  a  certain  sample  depends 
only  upon  r  and  the  relative  size  of  the  sample.  This  will  be  the 
case  if  all  k  sampled  populations  have  the  same  location  parameter  and 
if  all  assumptions  are  met. 

c.  Assumptions.  Populations  are  continuously  and,  exc  ept 
for  location,  identically  distributed.  Sampling  is  random  and  ob¬ 
servations  are  independent, 

d.  Treatment  of  Ties,  If  the  proportion  of  tied  observa¬ 
tions  is  small,  ties  are  a  practical  problem  only  if  the  smallest  one 
of  the  r  largest  observations  is  tied  with  an  observation  in  a  differ¬ 
ent  sample.  In  this  case  the  simplest  solution  is  to  reduce  the  value 
of  r  to  the  point  at  which  this  situation  no  longer  exists.  The  corres¬ 
ponding  probability  will  be  larger  than  the  true  probability  for  the 
unreduced  r,  and  the  test  will  therefore  be  conservative. 
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e.  Efficiency.  The  power  of  the  test  has  been  examined 
by  Mosteller  (25)  with  r  =  3  for  three  samples  of  three  observations 
each  from  normally  distributed  and  from  uniformly  distributed 
populations. 


f.  Application.  In  the  following  table, 
observations  are  all  in  sample  C.  Substituting 


the  four  largest 
r  -  4,  k  =  3,  n^ 


=  5, 


into  the 


Sample 


A 

B 

C 

25 

27 

41 

31 

35 

59 

44 

39 

64 

51 

48 

70 

52 

57 

72 

formula  given  earlier,  Pr(4) 


51  /(5-4)i  +  5!  /  (5-4)1  +  5!  /(5-4)I 
151  /(15-4)1 


1/91  or  .011.  This  same  value  could  have  been  obtained  by  con¬ 
sulting  Mosteller^s  (25)  exact  tables.  The  hypothesis  of  identical 
populations  is  therefore  rejected.  Assuming  identical  distribution 
forms,  different  distribution  locations  are  indicated,  and  the  most 
reasonable  presumption  is  that  the  median  of  population  C  lies  above 
those  of  populations  A  and  B. 


g.  Discussion.  Obviously  a  test  which  uses  as  test  sta¬ 
tistic  only  the  largest  observations  must  be  extremely  sensitive  to 
both  the  shape  and  location  of  the  upper  tail  of  the  distribution  of 
the  sampled  populations.  This  should  be  borne  in  mind  when  con¬ 
ducting  the  test.  If  the  assumptions  are  not  fully  met,  the  test  may 
be  merely  detecting  differences  in  contour -of-upper -tail  between 


310 


distributions  with  identical  locations. 


A  number  of  authors  (17,  37,46,  47,  48,  49)  have  exam¬ 
ined  tests  for  divergent  populations.  Tukey  (48,  49)  has  tabled 
the  probability  for  the  largest  column  total,  i.  e.  rank  sum,  when 
the  ranks  from  1  to  N  are  randomly  distributed  among  k  columns. 
Both  the  size  and  presence  of  an  entry  are  randomly  distributed, 
i.  e.  ,  a  given  column  may  contain  any  number  of  ranks  from  0  to 
N.  Tsao  (46)  has  published  tables  which  can  be  used  to  obtain 
the  probability  for  the  rank  sum  of  a  predesignated  column  when 
ranks  from  1  to  c  are  substituted  for  observations  matched  across 
rows  in  a  table  with  c  columns  and  r  rows. 

h.  Tables .  Exact  tables  have  been  published  by  Mosteller 
(25)  for  the  case  of  equal  sized  samples  (n^  =  n^  =  n^  =  3,  5,  7,  10, 

15,  20,  25,  <»)  with  2  <  k  <  6  and  2  <  r  <  5  or  6.  Approximate 

probability  tables,  appropriate  when  samples  are  of  unequal  size, 
have  been  published  by  Mosteller  and  Tukey  (26).  Approximate 
probability  formulae  are  also  given  by  Mosteller.  A  simple  asymp- 

r  “  1 

totic  approximation  is  Pr  (r)  ^  l/k 

i.  Sources.  17,  25,  26,  46,  47,  48,  49. 
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CHAPTER  Xm 


MISCELLANEOUS  TESTS 


The  following  chapter  presents  tests  which  do  not  appear  to 
be  readily  categorizable  within  the  topics  covered  by  the  previous 
chapters.  They  include  tests  for:  transitivity  of  preference  for 
a  single  judge,  agreement  among  several  judges,  trend  in  location, 
trend  in  dispersion,  goodness  of  fit,  aind  peripheral  association. 
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1 .  Paired  Comparisons:  ^^Consistency^^  of  a  Single  Judge 

(Transitivity  of  Preference) 


a*  Rationale.  Suppose  that  a  judge  is  presented  with  each 

of  the  (  ^  )  pairs  of  objects  which  can  be  made  with  n  objects  and  is 

required  to  express  a  preference  for  one  of  the  members  of  each 
pair  over  its  paired  mate.  If  his  preferences  are  transitive  and 
are  based  upon  subjectively  real  differences,  then  for  any  three 
objects,  say  A,  B  and  C,  if  A  is  preferred  over  B  and  B  is  pre¬ 
ferred  over  C  the  judge  must  necessarily  prefer  A  over  C.  Ex¬ 
pressed  differently,  if  the  three  objects  are  made  the  vertices  of 
a  triangle  and  if  an  arrow  is  placed  between  each  pair  of  objects, 
pointing  away  from  the  preferred  member  of  the  pair,  then  if 
preferences  are  real  and  transitive  the  arrows  will  not  all  point 
in  the  same  circular,  i.  e.  clockwise,  direction  as  is  the  case  in 

A 

the  "inconsistent*'  triangle,  or  "circular  triad". 


A  test  for  transitivity,  then,  can  be  based  upon  whether  or  not  the 
obtained  number  of  circular  triads  is  smaller  than  would  be  ex¬ 
pected  by  chance.  Let  the  n  objects  be  placed  at  the  vertices  of 
an  n  sided  polygon  with  arrows  drawn  between 
each  pair  of  objects,  indicating  the  direction 


of  preference. 


There  are  (^)  pairs  of  ob¬ 


jects  and,  therefore,  (^)  arrows. 


Each 


arrow  can  have  one  of  two  directions.  Therefore  there  are  2^2 


(?) 


different  patterns  of  arrow-directions  which  can  be  formed  by 
changing  directions  of  arrows  in  the  polygon.  The  number  of 

triads  in  the  polygon  is  a  constant,  i^)>  however,  the  number 

of  circular  triads  depends  upon  the  direction  of  the  arrows.  For 
1^) 

each  of  the  2 '2'  different  patterns  of  arrow-directions  there  will 
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be  some  number  of  circular  triads.  Therefore  the  probability 
for  that  number  or  a  smaller  number  of  circular  triads  is  simply 
the  number  of  patterns  of  arrow-directions  in  which  that  number 
or  a  smaller  number  of  circular  triads  occurs,  divided  by 


b.  Null  Hypothesis.  Each  of  the  2'2'  patterns  of  arrow- 
directions  was  equally  likely  to  have  been  the  one  obtained.  This 
will  be  the  case  if  the  judge  actually  has  no  real  preferences  in 


any  of  the  (^)  choice  situations  and  expresses  preferences  purely 
on  a  chance  basis. 


c.  Assumptions .  A  preference  is  expressed  for  one  of 

the  members  of  each  pair  of  objects,  i.  e.  ,  there  are  no  tied  choices. 
It  is  assumed  that  "trials**  are  randomly  selected;  this  is  necessary 
to  insure  that  the  sample  of  the  judge^s  behavior  is  representative 
of  his  behavior  in  general.  Random  selection  of  judges  is  not 
assumed  since  inference  is  confined  to  the  judge  tested.  Inde¬ 
pendence  of  choices  is  not  assumed  since  a  test  for  transitivity, 
in  a  sense,  tests  independence  rather  than  assuming  it. 

d.  Treatment  of  Ties,  Ties  should  be  obviated  by  using 
a  forced  choice  technique.  If  they  appear  anyhow,  the  simplest 
procedure  is  probably  to  discard  those  objects  for  which  the  great¬ 
est  number  of  ties  exist  and  to  continue  the  process  until  no  ties 
exist  among  the  remaining  objects.  The  test  may  then  be  con¬ 
ducted  upon  the  remaining  number  of  objects. 

e.  Efficiency.  No  information  available. 

f.  Application.  Six  vintages  of  a  certain  type  wine  are 
to  be  tested  as  to  taste.  The  vintages  are  presented  to  a  judge 
in  pairs  and  he  indicates  the  better  tasting  member  of  each  pair. 

This  is  done  for  all  15  possible  pairings  with  the  following  results, 
the  arrow  pointing  away  from  the  preferred  member  of  each  pair: 

A  B,  A  C,  A  — >  D,  A  — E,  A  F,  B  — ^  C,  B  — ^  D, 

B  — ^  E,  B  — F,  C  — >  D,  C  — ^  E,  C  F,  D  E,  —  F, 

E  — ►  F.  Obviously  the  only  intransitivity  is  D<f - F,  and  only 

triads  having  DF  as  a  side  can  be  circular.  The  following  polygon 
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therefore  shows  only  these  triads.  Only  one  of  these  triads,  DEF, 
is  circular.  a 


Consulting  Kendall^s  (30,  31,  32)  tables  it  is  found  that,  when 
n  =  6,  the  probability  of  two  or  more  circular  triads  is  ,  949. 
Therefore  the  probability  of  one  or  less  circular  triads  is 
1  -  .  949  or  .051.  The  .  05  level  of  significance  is  not  quite 
attained,  therefore,  and  the  hypothesis  that  preferences  are  either 
intransitive  or  determined  by  "chance"  cannot  be  rejected  in  favor 
of  the  alternative  hypothesis  of  "greater -than-chance"  transitivity 
of  preferences. 

g.  Discussion.  Kendall  apparently  takes  large  values 

of  d,  the  number  of  curcular  triads,  as  his  rejection  region.  Thus 
the  test  rejects  the  hypothesis  of  either  chance  or  transitive  prefer¬ 
ences  in  favor  of  the  alternative  hypothesis  that  the  judge’s 
preferences  are  intransitive  at  a  frequency  so  large  that  it  would 
seldom  occur  by  chance.  However,  one  would  expect  this  appli¬ 
cation  to  be  somewhat  less  frequent  than  the  one  described. 

h.  Tables.  The  exact  probability  for  d  or  more  circular 
triads  has  been  tabled  (30,  31,  32)  for  cases  in  which  2  <  n  <  7. 
When  n  is  larger  than  7,  the  probability  of  d  or  more  circular 
triads  is  1  minus  the  probability,  read  from  chi-square  tables,  of 

2^  2(3)  -  8d  +  4  n(n-l)(n-2)  .  ^  n(n-l)(n-2) 

A  -  - - +  - - -  with  - 

n-4  (n-4)  (n-4)^ 


degrees  of  freedom. 

The  probability  of  d-1  or  fewer  circular  triads  is  1  minus 
the  probability,  for  d  or  more  circular  triads.  It  is  therefore  ob¬ 
tained  by  taking  the  complement  of  the  probability  given  in  the  exact 
tables,  or  by  taking  the  probability  of  chi-square  as  defined  above, 
rather  than  its  complement. 
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Counting  the  number  of  circular  triads  may  prove  difficult 
when  n  is  not  small.  Kendall  (30)  has  shown  that  a  simpler  method 
may  be  used  to  gain  this  information.  An  n  x  n  table  is  constructed 
with  each  of  the  n  objects  being  represented  by  one  column  and  one 
row.  If  the  i^^  object  is  preferred  over  the  j^^  object,  a  1  is  en¬ 
tered  in  the  cell  of  the  i^^  row  and  the  j^^  column;  if  the  reverse 
is  the  case,  a  zero  is  entered.  All  cells,  except  those  whose  row 
and  column  represent  the  same  object,  are  filled  in.  If  the  row 

totals  are  a,  ,  a^,  .  .  .  ,  a  ,  then  the  number  of  circular  triads,  d, 
12  n 

,  ,  n(n-l)(n-2)  1 

IS  given  by  d  =  — ^ ^ ~  ^ 

i=l 

i.  Sources.  30,  31,  32,  40. 


Comparisons:  Agreement  among  m  Judges 


a.  Rationale.  Suppose  that  each  of  m  judges  has  expressed 

a  preference  for  one  of  the  members  of  a  pair  in  each  of  the  (^) 

possible  pairings  of  n  objects  and  that  it  is  desired  to  test  whether 
or  not  the  judges  tend  to  agree  among  themselves.  Let  C_  be  the 

number  of  judges  choosing  object  i  over  object  j.  Then  the  number 
of  judges  preferring  j  to  i  is  m-C,,.  The  C..  judges  preferring  i  to 

j  can  be  paired  with  one  another  in  (  2"^  ^  ways  and  each  way  repre¬ 
sents  an  agreement  between  two  judges  that  the  i^^  object  is  preferable 


to  the  j 


th 


Likewise  there  are  ^^3  )  pairs  of  judges  preferring  j  to 


i  and  there  are  that  many  ‘^agreements that  j  is  preferable  to  i.  The 

number  of  agreements  as  to  the  relative  excellence  of  objects  i  and  j, 
irrespective  of  which  object  is  the  one  preferred,  is  therefore. 
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C  •  •  m  -C  •  •  V"’ 

(  )  +  (  2  sum,  \ 


(=ii)  +  r-=ii) 


taken 


over  all  (^)  pairs  of  values  of  i  and  j  (corresponding  to  pairings 
of  objects  with  an  object  other  than  itself)  is  the  total  number  of 
agreements  among  the  m  judges  in  all  of  the  (^)  pairings  of  objects. 
This  sum,  represented  by  the  symbol  ^^fis  the  test  statistic. 


Now  consider  a  table,  such  as  that  shown  below,  with  m 


columns. 


corresponding  to  the  m  judges,  and 


pairs  of  rows. 


each  pair  of  rows  corresponding  to  a  pairing  of  objects  and  each 
row  in  a  pair  corresponding  to  preference  for  one  of  the  two  ob¬ 
jects  over  its  paired  mate.  In  each  cell  of  the  table  enter  a  1 
if  the  row  object  was  preferred  over  the  other  object  in  the  row 
pair  by  the  column  judge,  otherwise  enter  a  zero.  There  are  two 
ways  a  judge  can  assign  his  preference  to  one  of  the  members  of  a 
pair  of  objects,  i.  e.  ,  a  1  can  be  entered  for  either  the  first- 
listed  or  a  second-listed  object  in  a  pair  of  rows.  For  each  of 
these  two  ways,  there  are  two  ways  in  which  a  second  judge  can 
assign  his  preference  to  one  of  the  members  of  that  pair,  etc.  , 


so  since  there  are  m  judges  there  are  2^  ways  in  which  their 
preferences  can  be  assigned  to  the  members  of  a  single  pair 


of  objects. 


And  since  there  are  (^)  pairs  of  objects,  there  are 
m  ^2^  rn(^) 

(2  )  or  2  2  ways  in  which  m  judges  can  assign  their  pre¬ 

ferences  among  the  members  of  n  objects  judged  in  pairs.  Thus 


m(  ) 

there  are  2  '2^  different  tables,  i.e.  ,  tables  with  different  pat¬ 

terns  of  cell  entries,  which  can  be  formed  by  permuting  Is  and 
Os  within  their  column  and  pair  of  rows.  And  if  each  judge  assigns 


all  his  preferences  randomly  each  of  these  2  '2  '  tables  is  equally 

likely.  To  each  such  table  there  corresponds  a  value  of  ^and ,  if 

preference  assignments  are  random,  the  probability  of  this  or  a 
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larger  value  of 


^  is  simply  the 


number  of  the 


tables  giving 


rise  to  this  or  a  larger  value  of 


z 


divided  by 
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b.  Null  Hypothesis.  Each  of  the  2^  2^  tables  is  equally 
likely  to  have  been  the  table  actually  obtained.  This  will  be  the 
case  if  each  judge  assigns  his  preferences  randomly  among  the 
members  of  each  pair  of  objects  in  which  case  agreements  be¬ 
tween  judges  will  be  accidental  and  the  obtained  number  of  such 
agreements  will  be  determined  by  chance.  See  "Discussion". 

c.  Assumptions.  There  are  no  tied  choices,  i.  e.  ,  in  every 
choice  situation  one  of  the  objects  in  a  pair  must  be  preferred  over 
its  mate.  "Trials"  are  randomly  selected;  this  assumption  is 
necessary  to  insure  that  the  sample  of  the  judge's  behavior  is  repre¬ 
sentative  of  his  behavior  in  general.  Random  selection  of  judges 

is  not  assumed  since  inference  is  confined  to  the  judges  tested. 
Independence  of  choices  is  not  assumed,  rather  it  is  tested. 

d.  Treatment  of  Ties.  Ties  should  be  obviated  by  using  a 
forced  choice  technique.  If  they  appear  anyhow,  the  simplest  pro¬ 
cedure  is  probably  to  confine  the  test  to  those  judges  or  to  those 
objects  for  which  no  tied  choices  appear,  making  the  necessary 
reductions  in  m  and/or  n. 

e.  Efficiency.  No  information  available. 

f.  Application.  Suppose  that  each  of  four  judges  compares 
three  brands  of  chocolate  ice  cream  in  pairs  and  expresses  a  pre¬ 
ference  in  each  case.  It  is  desired  to  test  whether  or  not  the 
judges  tend  to  agree  among  themselves.  Let  the  data  be  shown 
below: 
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The  value  of  ^  is  14.  Entering  Kendall’s  (30,  31)  tables  with 
m  =  4,  n  =  3  and  ^  ■  14,  we  find  that  this  value  of^has  a  prob¬ 


ability  of  .  043  of  being  equalled  or  exceeded.  Therefore  the  null 
hypothesis  of  random  assignment  of  preferences  is  rejected  in  favor 
of  the  alternative  hypothesis  that  there  is  a  nonchance  degree  of 
agreement  among  judges. 


puted. 


The  probability  obtained  from  tables  could  have  been  com- 

m(^)  12 

There  are  2  '2'  or  2  possible  patterns  of  preferences 


for  the  table  shown.  Greatest  agreement  would  occur  if  in  each 
pair  of  rows  the  Is  were  all  in  one  row,  the  zeros  in  the  other. 
There  are  two  ways  in  which  the  Is  can  all  be  in  one  row  of  a  pair 


of  rows,  and  since  there  are  three  pairs  of  rows  there  are  2' 


8 


ways  in  which  the  greatest  agreement  can  occur,  leading  to  a  y  of 

18.  The  next  greatest  amount  of  agreement  occurs  when  in  two 
pairs  of  rows  the  Is  are  all  in  a  single  row  of  the  pair  and  in  the 
third  pair  of  rows  three  Is  are  in  one  row,  the  remaining  1  in  the 
other  row.  There  are  2^  ways  in  which  for  both  of  two  given  pairs 
of  rows  all  the  1  s  in  a  pair  can  be  in  one  row.  In  the  remaining 
pair  of  rows  either  row  can  be  selected  to  contain  the  single  1, 
and  the  1  cam  occur  in  amy  of  its  four  cells,  making  eight  ways 
of  obtaining  four  Is  in  one  row  and  one  1  in  the  other  row  of  a 
given  pair.  Finally,  the  pair  of  rows  one  of  which  contains  three 
Is,  the  other  a  single  1,  could  occur  for  amy  one  of  the  three  pairs 
of  objects.  Therefore  there  are  (2^)  (8)  (3)  =  96  ways  in  which  the 


next  greatest 


15,  could  be  obtained. 


The  next  greatest  amount 


of  agreement  is  for  the  obtained  case,  where  in  two  pairs  of  rows 
the  Is  are  all  in  one  row  of  the  pair,  and  in  the  remaining  pair  of 
rows  each  row  contains  two  Is.  This  case  differs  from  the  pre¬ 
ceding  one  only  in  the  number  of  ways  of  assigning  1  s  in  the  re¬ 


maining  pair  of  rows; 


there  are 


6  ways  of  placing  two  Is  in 
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two  of  the  four  cells  of  a  row*  So  for  the  last  case  there  are 


(2  )(6)  (3)  =  72  ways  of  obtaining  a  ^  of  14* 


^  r  ^A  4-  '  4-u  f  8+96+72 

a  ^  of  14  or  greater  is  therefore  - — 


The  probability  for 

-  .  043. 


A  somewhat  simpler  tabulational  procedure  than  that 
given  above  is  to  form  an  n  x  n  table  with  rows  and  columns  both 
representing  objects.  The  cell  entry,  r_,  in  the  i^^  row  and 

column  is  the  number  of  judges  who  prefer  i  when  it  is  compared 


with  j.  The  value  of  ^  is  found  by  summing  (^ij  )  over  all  n(n-l) 


cells  in  the  table  corresponding  to  preferences  for  the  row  object 
over  a  different  column  object.  For  the  data  just  given  the  table 
would  be: 


ABC 
A 
B 
C 


and  ^  would  be  +  (2)  ^  ^2)  ^  ^z)  6+1 

=  14  as  before. 

g.  Discussion.  Strictly  speaking,  the  null  hypothesis  is 
that  all  preferences  are  assigned  randomly  since  the  use  of  an 

m(^) 

unweighted  2  '2^  as  the  denomirator  of  a  probability  fraction  im¬ 

plies  that  this  is  so,  i.  e.  ,  since  the  tables  for  ^  are  based  upon 

its  chance  distribution.  Preferences,  of  course,  can  be  assigned 
quite  systematically  without  there  being  any  substantial  measure 
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of  agreement  in  the  group  of  judges  as  a  whole.  For  example  half 
the  judges  may  always  prefer  the  **alphabetically  higher**  object  of 
a  pair  and  half  may  always  prefer  the  **alphabetically  lower**.  In 
cases  such  as  this  one,  where  there  are  systematic  but  opposing 
biasses  among  judges,  the  null  hypothesis  as  stated  is  false,  but 

^will  not  assume  an  extreme  value  calling  for  its  rejection. 

The  null  hypothesis  is  likely  to  be  rejected  if  there  are  systematic 
but  unopposing,  biasses,  but  this  condition  amounts  to  **agreement** 
among  judges.  Therefore,  since  rejection  of  the  null  hypothesis  can 
only  be  caused  by  chance  (to  the  degree  implied  by  the  significance 
level  used)  or  by  agreement  among  judges,  it  can  be  regarded,  as 
a  practical  matter,  as  stating  simply  that  there  is  no  nonchance 
degree  of  agreement  among  judges. 

It  is  to  be  noted  that  agreement  among  judges  does  not 
imply  transitivity  of  preference.  For  example  in  paired  compar¬ 
isons  of  three  objects,  there  might  be  complete  agreement  in  that 
all  judges  prefer  A  to  B,  B  to  C  and  C  to  A,  which  set  of  prefer¬ 
ences  forms  a  circular,  or  ''inconsistent**  triad,.  Nor  does  trans¬ 
itivity  for  each  judge  imply  agreement  among  judges.  When  either 
agreement  or  transitivity  is  lacking,  it  would  not  be  legitimate  to 
rank  the  n  objects  from  best  to  worst  on  the  basis  of  preferences 
expressed  in  paired  comparisons, 

A  test  for  agreement  among  judges  is  useful  when  the 
thing  being  measured  is  of  a  strictly  subjective  nature,  such  as 
the  relative  deliciousness  of  a  variety  of  flavors.  The  paired  com¬ 
parison  technique  is  useful  when  the  n  things  being  compared 
differ  along  so  many  dimensions  or  in  such  a  complex  way  that 
they  cannot  properly  be  ranked  from  best  to  worst.  The  tech¬ 
nique  is  also  useful  when  judgments  are  strongly  affected  by  such 
sequential  factors  as  the  immediately  preceding  trial,  the  number 
of  preceding  trials  and  the  interval  between  trials.  This  type  of 
situation  arises,  for  example,  in  taste  testing  where  the  sensitivity 
of  the  taste  buds  depends  upon  the  nature,  number  and  duration  of 
the  preceding  stimuli  and  upon  the  interval  between  the  present 
and  the  preceding  stimulus.  In  order  to  corrpare  properly  two 
taste  stimuli,  they  must  not  be  separated  by  intervening  stimuli. 

The  method  of  paired  comparisons,  therefore,  is  generally  used. 
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The  test  described,  originating  with  Kendall  and  Smith 
(32),  appears  to  be  the  simplest  and  easiest  to  apply.  However, 
exact  tests  for  paired  comparisons  have  also  been  devised  and 
tabled  by  Bradley  and  his  colleagues  (1,  5,  6,  7,  8,  9,  58),  A 
test  somewhat  analogous  to  that  of  Kendall  and  Smith  has  been 
outlined  and  tabled  by  Cartwright  (10).  However  it  is  not  con¬ 
nected  with  the  method  of  paired  comparisons.  Instead  it  tests 
multijudge  reliability  when  each  of  m  judges  assigns  each  of  n 
objects  to  one  of  K  categories. 


h.  Tables.  Exact  probabilities  have  been  tabled  (30,  31, 
32)  for  ^  for  the  cases  m  =  3,  2  <  n  <  8;  m  =  4,  2  <  n  <  6;  m  =  5, 
2  <  n  <  5  and  m=  6,  2<n<  4.  When  m  or  n  exceed  these  values, 
approximate  probabilities  for  ^  may  be  obtained  by  referring 


4 


m-2 


m-3 
2  (m  -  2 ) 


with  (^) 


m(m--l ) 
(m-2)^ 


degrees 


of  freedom,  to  the  probability  tables  for  chi-square,  A  correction 
for  continuity  may  be  made  by  subtracting  1  from  ^  . 


i.  Sources,  4,  23,  30,  31,  32,  40,  (See  also  1,  5,  6, 
7,  8,  9,  19,  27,  58,) 


3.  The  Differ ence-Sign  Test  for  Trend 

a.  Rationale,  Suppose  that  N  observations  have  been 
made  in  sequence  upon  a  continuously  distributed  variable  and  it  is 
desired  to  test  whether  or  not  the  variable's  fluctuations  contain  a 
temporal  trend.  Let  each  observation  (except  the  first)  be  sub¬ 
tracted  from  the  observation  immediately  preceding  it,  and  record 
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only  whether  the  difference  is  positive  or  negative.  The  number  of 
algebraic  signs  of  one  kind  recorded  for  the  N-1  subtractions  is  the 
test  statistic.  If  there  is  a  trend,  signs  of  one  kind  should  predom¬ 
inate,  Suppose  that  the  N  observations  were  ranked  in  order  of 
size  from  1  to  N,  If  there  were  no  trend,  then  each  of  the  Ni  per¬ 
mutations  of  the  integers  from  1  to  N  would  be  equally  likely  to  be 
the  obtained  sequence  of  ranked  observations.  Therefore,  in  the 
absence  of  trend,  the  probability  of  obtaining  m  or  more  minus 
difference -signs  is  simply  the  number  of  the  Nl  permutations  of 
integers  from  1  to  N  which  yield  m  or  more  minus  differences,  when 
each  integer  is  subtracted  from  the  one  preceding  it,  divided  by  NI 

b.  Null  Hypothesis,  Each  of  the  NI  permutations  of  the 
N  observations  is  equally  likely  to  have  been  the  sequence  obtained. 

If  a  monotonic  trend  exists,  this  will  not  be  the  case. 

c.  Assumptions.  The  sampled  population  is  continuously 
distributed,  i,  e,  ,  there  are  no  tied  observations.  Sampling  is  ran¬ 
dom  in  the  sense  that  the  moment  at  which  an  observation  is  taken 

is  selected  without  knowledge  as  to  the  magnitude  the  observation 
will  have  at  that  moment, 

d.  Treatment  of  Ties.  A  small  number  of  tied  observa¬ 
tions  are  a  practical  problem  only  when  they  are  adjacent  in  sequence. 
In  this  case,  for  a  conservative  test,  give  all  zero  differences  the 
sign  least  conducive  to  rejection  of  the  null  hypothesis.  To  mini¬ 
mize  tie  error  in  the  long  run,  arbitrarily  give  half  the  zero  differ¬ 
ences  a  plus  sign,  half  a  minus  sign. 

e.  Efficiency,  Against  normal  regression  alternatives, 
the  difference-sign  test  has  an  asymptotic  relative  efficiency  of 
zero  with  respect  to  the  regression  coefficient  test,  as  well  as 
with  respect  to  a  half-dozen  distribution-free  tests  (55),  See 
Table  I  in  the  Introduction,  It  is  superior  in  efficiency  to  the 
turning  points  test.  An  A,R,E,  of  zero  does  not,  of  course, 
mean  that  the  test  is  useless,  (See  Introduction,  ) 

The  test  has  been  found  to  be  consistent,  and  its  power 
has  been  investigated,  in  the  case  of  normal  regression  alterna¬ 
tives  (20,  56), 
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f.  Application.  Seven  observations  are  taken  in  sequence 
and  are  as  follows:  95,  88,  86,  81,  84,  77,  72.  Starting  with  the 
second  observation  and  subtracting  each  observation  from  the  pre¬ 
ceding  one  we  have  the  following  sequence  of  difference-signs: 

+  ,  +,  +,  -,  +,  +.  Entering  Moore  and  Wallis'  tables  with  N  -  7 
and  m  ■  1,  we  obtain  .048  as  the  probability  of  1  or  fewer  differ¬ 
ences  of  like  sign.  Therefore,  the  null  hypothesis  of  no  trend  can 
be  rejected  at  the  two  tailed  .  048  level,  or  if  the  null  hypothesis 
was  that  there  is  either  no  trend  or  an  upward  trend  it  could  be 
rejected  at  the  one-tailed  .  024  level  of  significance. 

g.  Discussion.  Moore  and  Wallis  (39)  and  Stuart  (56) 
have  also  considered  tests  for  correlation  between  two  series  of  ob¬ 
servations.  Stuart  aligns  the  two  sequences  of  difference-signs,  one 
below  the  other,  and  takes  as  his  test  statistic  the  number  of  columns 
containing  like  difference  signs.  Moore  and  Wallis  tabulate  the  fre¬ 
quency  of  occurrence  of  each  of  the  four  possible  combinations 

of  sign  among  the  two  entries  in  a  column  and  analyze  by  means  of 
a  fourfold  table.  Unfortunately  these  tests  appear  to  be  strictly 
legitimate  only  if  neither  series  contains  a  real  trend,  in  which 
case  the  true  correlation  would  be  zero.  (See  39  page  l6l.  )  For 
large  samples  they  may  be  useful  as  approximate  tests. 

h.  Tables.  Exact  two-tailed  probabilities  for  the  number 
of  difference-signs  of  one  sign  have  been  tabled  by  Moore  and  Wallis 
(39)  for  all  values  of  N  between  2  and  11. 


For  larger  values  of  N,  the  number,  m,  of  minus  differ¬ 
ence-signs  is  approximately  normally  distributed  with  mean  (N-l)/2 
and  variance  (N+l)/12.  Therefore  approximate  probabilities 
may  be  obtained  by  entering  the  normal  tables  with 


Z  = 


m 


N  -  1 
2 


A  correction  for  continuity  can  be  intro¬ 


duced  by  reducing  the  absolute  value  of  the  numerator  by 


The 


probability  obtained  will  be  one-tailed  unless  the  tables  give  two- 
tailed  probabilities. 


i.  Sources.  20,  36,  39,  54,  55,  56. 
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4.  Records  Tests  for  Trend  in  Location  or  Dispersion 


a.  Rationale.  Let  the  r^^  observation  in  a  sequence  of  n 
observations  be  called  an  upper  record  if  it  is  larger,  and  a  lower 
record  if  it  is  smaller,  than  all  of  the  r-1  preceding  observations. 

(By  definition  the  first  observation  is  not  a  record  value.  )  If  there 
is  no  trend  in  the  sampled  variable,  then  each  of  the  nl  permutations 
of  the  n  observations  was  equally  likely  to  have  been  the  sequence  ob¬ 
tained,  and  any  statistic  based  upon  records  should  have  a  chance  value. 
On  the  other  hand,  if  there  is  a  monotonic  upward  (downward)  trend  in 
location,  then  each  observation  has  a  greater-than-chance  likelihood 

of  being  an  upper  (lower)  record  and  a  smaller-than-chance  likelihood 
of  being  a  lower  (upper)  record,  and  the  difference,  d,  defined  as  the 
number  of  upper  records  minus  the  number  of  lower  records  should  tend 
to  assume  extreme  positive  (negative)  values.  Likewise  if  there  is  a 
monotonic  trend  toward  increasing  (decreasing)  dispersion,  then  each 
observation  has  a  greater  (smaller)  than  chance  likelihood  of  being  a 
record  of  either  type,  and  the  sum,  s,  defined  as  the  number  of  upper 
records  plus  the  number  of  lower  records,  should  tend  to  assume  an 
extremely  large  (small)  value.  The  probability  for  a  given  value  of  d, 
or  of  s,  is  simply  the  proportion  of  the  ni  permutations  of  the  integers 
from  1  to  n  which  yield  that  value  of  the  statistic. 

b.  Null  Hypothesis.  Each  of  the  n!  possible  permutations 
of  the  n  untied  observations  was  equally  likely  to  have  been  the  se¬ 
quence  obtained  in  the  sample. 

c.  Assumptions.  The  sampled  population  is  continuously 
distributed,  i.  e.  ,  there  are  no  tied  observations.  Sampling  is  ran¬ 
dom  and  independent. 

d.  Treatment  of  Ties,  The  authors  recommend  that  ties 
be  broken  randomly,  i.  e.  ,  that  one  should  **rank  the  tied  observations 
according  to  a  random  permutation  of  their  serial  order.  However, 
for  a  conservative  test  resolve  ties  in  the  manner  least  conducive  to 
rejection  of  the  null  hypothesis. 

e.  Efficiency.  As  a  test  for  randomness  against  normal 
regression  alternatives,  the  d  test  has  an  asymptotic  relative  effic¬ 
iency  of  zero  with  respect  to  the  best  parametric  test  based  on  the 
regression  coefficient  and  with  respect  to  some  half-dozen  distri- 
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bution-free  tests.  It  is  more  efficient  than  either  the  difference- 
sign  test  or  the  turning  points  test,  both  of  which  have  zero  A,R,E, 
with  respect  to  the  d  test,  (55),  See  Table  I  in  Introduction, 

The  power  of  both  the  d  test  and  the  D  test  (see  Discussion) , 
at  the  ,  05  level,  against  the  alternative  that  the  sampled  variable  is 
normally  distributed  with  constant  variance  but  with  a  positive  linear 
trend  in  the  mean  has  been  tabulated  by  Foster  and  Stuart  (20)  for 
various  sample  sizes  and  degrees  of  trend.  These  power  functions 
were  obtained  empirically  by  means  of  a  large  sampling  experiment. 

The  d  test  is  consistent  against  the  alternative  that  the  form 
of  the  sampled  population's  distribution  remains  constant  while  a 
location  parameter  increases  by  equal  increments  along  the  sequence. 
The  s  test  is  consistent  against  an  analogous  alternative  involving 
trend  in  dispersion  only  (20),  The  authors  believe  these  consistency 
properties  to  apply  also  to  the  round-trip  tests  (see  Discussion), 

f.  Application,  A  rat  makes  the  following  sequence  of  time 
scores  in  running  a  very  difficult  maze,  460,  457,  459,  455,  453, 

451,  and  it  is  desired  to  test  whether  or  not  the  rat  is  learning.  There 
are  four  lower  and  no  upper  records.  Entering  Foster  and  Stuart*s 
(20)  tables  the  probability  that  d  •  -3  or  a  larger  value  is  found  to 

be  ,  985,  so  the  probability  that  d  *•  -4  or  less  is  ,015  and  the  hypo¬ 
thesis  of  no  learning  is  rejected. 

Had  the  rats'  scores  been  455,  457,  456,  453,  450,  465,  447, 
463,  475,444,  449  there  would  have  been  three  upper  and  four  lower 
records.  The  hypothesis  that  the  rat's  variability  was  increasing  with 
time  (possibly  indicating  the  testing  and  rejection  of  false  hypotheses 
by  the  rat)  can  be  tested  by  entering  Foster  and  Stuart's  tables  with 
n  =  11  and  8  =  7.  The  probability  that  s  does  not  exceed  6  is  found 
to  be  •  964,  so  the  probability  of  an  s  of  7  or  greater  is  .036  and  the 
hypothesis  of  constant  variability  is  rejected  in  favor  of  the  hypothesis 
that  it  is  increasing. 

g.  Discussion.  The  statistics  d  and  s  are  asymptotically 
independent.  Therefore  when  n  is  quite  large  a  general  test  of  the 
null  hypothesis  of  randomness  against  alternatives  of  nonrandomness 
can  be  made  by  combining  probabilities  for  d  and  for  s,  using  the 
conventional  methods  for  combining  independent  probabilities. 
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The  number  of  upper  records,  when  proceeding  from  the 
first  to  the  last  observation  is  not  necessarily  the  number  of  upper 
nor  necessarily  the  number  of  lower,  records  when  proceeding  in 
the  opposite  direction,  and,  in  fact,  is  unlikely  to  be  so.  There¬ 
fore  additional  ’^information’*  is  contained  in  the  ’’round-trip” 
statistic  D  »  d  -  d’  where  d’  is  analogous  to  d  but  counted  by 
proceeding  from  the  last  observation  to  first.  No  exact  small 
sample  tables  for  D  are  available;  however,  when  n  is  large,  D 
is  approximately  normally  distributed  with  mean  of  zero,  so 

—  may  be  treated  as  a  normal  deviate  and  probabilities  obtained 

°’d 

from  normal  tables.  Unfortunately  (Tj^is  not  easy  to  obtain;  how¬ 
ever,  a  few  approximate  values  have  been  tabled:  Table  4  of  (20) 
gives  empirical  values  of  cTj^corresponding  to  n’s  of  10,  25,  50,  75, 

100,  and  125  based  upon  a  large  sampling  experiment.  The  D  test 
was  found  to  be  considerably  more  powerful  than  the  d  test  on  the 
basis  of  a  sampling  experiment  conducted  by  its  authors. 

h.  Tables.  Tables  have  been  published  (20)  which  give 
the  exact  probability  that  d  does  not  exceed  given  values  when 
3  <  n  <6.  Other  tables  (20)  give  the  exact  probability  that  s  does 
not  exceed  given  values  for  3  ^n^  15, 


When  these  tables  do  not  apply,  approximate  tests  may 


be  performed  by  taking 


as  normal  deviates  and 


obtaining  approximate  probabilities  by  referring  them  to  normal 
tables.  The  value  of  d  is  zero.  The  values  of  s,  and  the  staaid- 
ard  errors,  cr^  and  are  given  in  Table  3  of  (20)  for  values  of 


n  from  10  to  100  in  steps  of  5. 


i.  Sources.  11,  20,  55. 
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5. 


The  S 

1 


Sign  Test  for  Trend 


a.  Rationale >  A  number  of  tests  for  trend  use  as  the 
test  statistic  the  number  of  difference-signs  of  one  type  resulting 
from  a  series  of  subtractions  of  subsequent  from  earlier  obser¬ 
vations.  However,  these  tests  are  not  equally  efficient.  If  a 
real  monotonic  trend  exists,  then  the  farther  apart  two  observa¬ 
tions  are  in  the  sequence  the  greater  their  difference  in  size  is 
likely  to  be  and  the  greater  is  the  likelihood  that  the  sign  of  their 
difference  will  correspond  to  the  direction  of  the  trend.  There¬ 
fore,  Cox  and  Stuart  subtract  the  observation  from  the  first, 

the  (N-l)st  from  the  2nd,  the  (N-2)nd  from  the  3rd,  etc.  ,  and 
weight  each  difference-sign  by  the  distance  between  the  obser¬ 
vations  giving  rise  to  it.  Thus  if  h..  is  defined  to  be  1  when 
the  i^^  observation  is  greater  than  th^i^  i.  e.  ,  if  their  differ¬ 

ence-sign  is  plus,  and  to  be  zero  when  the  reverse  is  the  case, 
then  Cox  and  Stuart’s  test  statistic  is 


N 

k=l 

This  statistic  is  asymptotically  normally  distributed  with  mean 


2  2 

N  /8  and  variance  N  (N  -  1  )/24,  thus  providing  a  large-sample, 
approximate  test  of  significance.  N  must  always  be  made  an 
even  number.  When  there  is  an  odd  number  of  observations, 
the  middle  observation  is  dropped. 


b.  Null  Hypothesis.  Each  of  the  N]  permutations  of  the 

N  observations  was  equally  likely  to  have  been  the  sequence  obtained. 

c.  Assumptions .  The  sampled  population  is  continuously 
distributed,  i.  e.  ,  there  are  no  tied  observations.  Sampling  is  ran¬ 
dom  and  independent. 

d.  Treatment  of  Ties.  A  small  number  of  ties  does  not 
create  a  practical  problem  unless  a  k^^  observation  is  tied  with  an 
N  -  k  +  1st  observation.  In  this  event,  resolve  ties  in  the  manner 
least  conducive  to  rejection  of  the  null  hypothesis,  for  a  conserva¬ 
tive  test;  or,  to  minimize  error  in  the  long  run,  give  h  a  value  of 
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—  ,  i.  e.  ,  half  way  between  a  zero,  indicating  a  minus,  and  a  1, 
indicating  a  plus. 

e.  Efficiency.  As  a  test  for  randomness  against  normal 
regression  alternatives,  the  sign  test  for  trend  in  location  has 

asymptotic  relative  efficiency  of  .  86  relative  to  the  best  para¬ 
metric  test  based  on  the  regression  coefficient  and  has  an  A.R.  E. 
of  .87  relative  to  Kendall's  rank  correlation  test,  i.  e.  Mann's 
T  test.  It  is  more  efficient  than  a  number  of  other  distribution- 
free  tests  for  trend  (IZ,  55).  See  Table  I  in  Introduction. 

f.  Application.  Let  the  observations  be  50,  51,  5Z, 

34,  54,  56,  55,  51,  20,  47,  42,  43,  44,  41,  28,  35,  39,  36,  30, 
31,  29,  23,  25,  18,  21.  There  are  25  observations,  so  the  mid¬ 
dle  observation,  44,  is  dropped,  leaving  N  =  24.  The  differences 
are  (50-21),  (51-18),  (52-25),  (34-23),  (54-29),  (56-31),  (55-30), 
(51-36),  (20-39),  (47-35),  (42-28),  (43-41 ),  and  the  corresponding 
values  of  h  are  1,  1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  1.  The  corres¬ 
ponding  weights  of  h,  followed  in  parentheses  by  the  value  of  h  are 
23(1),  21(1),  19(1),  17(1),  15(1),  13(1),  11(1),  9(1),  7(0),  5(1), 

3(1),  1(1).  Thus 

v'  N/2 

^1=2/  V  N  -  k+  1  =  ^ 

k=l 


If  the  null  hypothesis  is  true,  this  value  has  a  mean  of  approximately 


7  2 

N  /8-  24  /8  =  72  and  a  variance  of 
N(N^-l)/24  =  24(24^-l)/24  =  575  and 

Vs  7  5 


2.71  is  approx¬ 


imately  a  normal  deviate.  Entering  the  normal  tables  with  this 
value  we  find  that  the  obtained  value  of  is  significant  at  the  two¬ 


tailed  .  01  level  (or  at  the  one-tailed  ,005  level).  In  view  of  the 
small  value  of  N  used,  these  probabilities  should  be  regarded  as 
very  approximate. 
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g.  Discussion,  In  addition  to  its  use  as  a  test  for  trend 
in  location,  the  outlined  technique,  slightly  modified,  can  also  be 
used  to  test  for  trend  in  dispersion.  (In  which  case  it  has  A.R,E, 
of  .74,  under  **parametric*’  conditions,  relative  to  the  maximum 
likelihood  test.)  The  sequence  of  N  observations  is  divided  into 

r  blocks,  each  block  containing  the  same  number,  k,  of  conse¬ 
cutive  observations,  the  N-rk  **extra**  observations  being  randomly 
selected  and  discarded.  The  range,  w,  is  determined  for  each 
block.  The  sequence  of  ranges  of  consecutive  blocks  is  then  treated 
as  a  sequence  of  r  ’*observations'*  and  tested  for  trend  in  location  by 
means  of  the  sign  test  already  outlined.  A  monotonic  trend  in 

’’location**  of  ranges  is  equivalent  to  a  monotonic  trend  in  the  dis¬ 
persion  of  the  observations  upon  which  the  ranges  are  based. 

h.  Tables.  There  appear  to  be  no  tables  of  exact  prob¬ 
abilities,  so  the  test  should  not  be  used  when  N  is  small.  As  N 


approaches  infinity,  the  distribution  of 


Si 


8 


v 


approaches 


N(N  -1) 
24 


the  normal  distribution  whose  mean  is  zero  and  whose  variance 
is  unity.  Therefore  tables  of  the  normal  distribution  cam  be  used 
to  obtain  approximate  probabilities  when  N  is  moderately  large. 

i.  Sources.  12,  55. 


6.  David's  Combinatorial  Tests  of  Fit 


a.  Rationale.  Suppose  that  an  experimenter  has  a  sample 
of  N  observations  and  that  he  wishes  to  test  whether  or  not  the  sample 
came  from  am  hypothesized  population  whose  distribution  he  can 
specify  completely.  Since  the  population  distribution,  under  the 
null  hypothesis,  is  known,  it  can  be  divided  into  N  nonoverlapping 
vertical  strips,  each  of  which  contains  the  same  area,  1/N. 
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To  these  N  vertical  strips  there  will  be  N  corresponding  ranges  of 
abscissa  values,  each  range  having  equal  probability,  1/N,  of  ’^con¬ 
taining’*  an  observation  drawn  randomly  from  the  population.  Now 
let  N  observations  be  drawn  and  let  Z  be  the  number  of  ranges 
containing  zero  observations,  i.e*  no  observations.  If  the  sample 
was  drawn  from  the  hypothesized  population,  then  Z  will  have  a 
’’chance”  value;  if  the  sample  was  actually  drawn  from  some  other 
population,  then  Z  will  tend  to  assume  large  values  with  greater- 
than-chance  probability.  The  probability  of  a  given  value  of  Z, 
when  the  null  hypothesis  is  true,  is  simply  the  number  of  ways 
N  balls  can  be  dropped  into  N  boxes  or  compartments  so  as  to 
leave  an  unspecified  Z  compartm.ents  empty,  divided  by  the  num¬ 
ber  of  ways  N  balls  can  be  dropped  into  N  boxes  without  restric¬ 
tion.  These  probabilities  have  been  tabled  by  David  (16). 

The  above  test  is  a  test  of  fit  against  general  alternatives. 
However,  if  the  experimenter  suspects  certain  alternatives  to  be 
more  likely  than  others,  he  may  wish  to  specify  the  general  loca¬ 
tion  of  the  ’’empty  compartments”,  i.  e.  ,  ranges  containing  no  sample 
observation.  For  example,  if  the  true  population  is  believed  to  have 
the  same  form  as  the  hypothesized  population  but  a  larger  median, 
then  one  would  expect  more  empty  compartments  below  the  median 
of  the  hypothesized  population  than  above  it.  Likewise,  if  the  true 
and  hypothesized  populations  are  symmetrical  and  have  equal  means 
but  different  variances,  the  variance  of  the  true  population  being  the 
larger,  then  one  would  expect  more  empty  compartments  in  the  mid¬ 
dle  than  at  the  extremes  of  the  hypothesized  distribution.  David 
(16),  therefore,  has  proposed  a  second  test  in  which  the  hypothesized 
distribution  is  divided  into  2N  nonoverlapping  vertical  strips  of  equal 
area,  of  which  N  are  selected  to  be  the  ”test”  compartments.  A 
sample  of  N  observations  is  then  drawn  and  the  number,  Z,  of  empty 
compartments  among  the  predesignated  N  test  compartments  is 
counted.  Probabilities  for  Z  in  this  second  test  have  also  been 
tabled  by  its  author. 

b.  Null  Hypothesis.  Each  of  the  N  sample  observations 
was  equally  likely  to  have  been  drawn  from  each  of  the  N  (or  in  the 
case  of  the  second  test,  2N)  ranges  of  abscissa  values  correspond¬ 
ing  to  equal  areas  of  the  hypothesized  distribution.  This  will  be 
the  case  if  the  hypothesized  distribution  is  the  distribution  sampled 
and  if  all  assumptions  are  met. 
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c*  As  sumptions  >  Sampling  is  random  and  independent 
and  there  is  zero  probability  that  an  observation  will  be  tied  for  in¬ 
clusion  in  adjacent  abscissa  ranges,  i.  e.  ,  that  it  will  fall  at  the 
common  endpoint  of  two  abscissa  ranges, 

d.  Treatment  of  Ties.  If  the  hypothesized  distribution 
is  discontinuous  and  an  endpoint  of  an  abscissa  range  happens  to 
coincide  with  one  of  the  discrete  population  values,  the  test  had  best 
be  avoided.  Ties  due  to  this  cause  could,  of  course,  be  broken  and 
assigned  among  the  two  adjacent  ranges  in  the  same  proportion  as 
would  be  required  to  break  the  relative  frequency  of  the  discrete 
value  in  order  to  maintain  equal  areas  in  the  hypothesized  distri¬ 
bution. 

If  the  hypothesized  distribution  is  continuous,  ties  may 
be  broken  by  assigning  them  in  the  manner  least  conducive  to  re¬ 
jection  of  the  null  hypothesis,  by  assigning  half  of  each  group  of 
ties  to  each  of  the  two  ranges  tied  for,  or  by  breaking  them  random¬ 
ly.  See  Introduction. 

e.  Efficiency.  Formulae  by  which  to  obtain  power  func¬ 
tions  are  given  (16)  for  both  tests  by  their  aathor,  and  certain  power 
comparisons  are  made.  The  power  of  the  first  test  and  the  power 
of  chi-square  to  reject  the  hypothesis  that  the  population  is  normally 
distributed  with  zero  mean  auid  unit  standard  deviation  was  obtained 
(using  N  ■  30  and  oc  s  ,  05)  when  the  distribution  and  mean  are  as 
hypothesized  but  the  standard  deviation  is  4/3.  The  ratio  of  the 
power  of  the  zeros  test  to  that  of  chi-square  was  .  968. 

f.  Application.  It  is  hypothesized  that  a  certain  popula¬ 
tion  is  normally  distributed  with  a  mean  of  500  and  a  standard  devia¬ 
tion  of  10,  In  order  to  test  this  hypothesis,  a  sample  of  six  obser¬ 
vations  is  drawn  from  the  population  in  question,  their  values  being: 
457,  462,  489,  515,  538,  564.  From  normal  tables  we  find  that 


1^ 

3 

2 

3 


of  the  area  of  a  normal  curve  is  between  ±  .  4307 cr  of  the  mean  and 
of  the  area,  between  ±  .  9674(r  of  the  mean.  Therefore  since  it 


is  symmetrical,  the  normal  curve  is  divided  into  six  equal  and  non¬ 
overlapping  areas  by  the  points  [i  -  .  9674(r,  p  -  .  4307(r>  fJ-  , 

|JL  +.  4307 (T  ,  and  |Jl  +  .  9674(r*  Substituting  500  for  M-  and  10  for 
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these  points  become:  490.  326,  495.  693,  500.  000,  504.307,  and 
509.674,  and  the  six  ranges  or  "compartments"  are  -  oo  to  490.326 , 
490.  326  to  495.  693,  495.  693  to  500.  000,  500.  000  to  504.  307, 
504.307  to  509.  674,  and  509.  674  to  +  .  The  six  sample  observa¬ 

tions  all  fall  into  two  of  the  six  ranges  leaving  four  compartments 
empty.  Entering  David's  tables  with  N  ■  6  and  Z  -  4,  the  prob¬ 
ability  of  four  or  m6re  empty  compartments  is  found  to  be  .  0200 
and  the  null  hypothesis  is  therefore  rejected  at  better  than  the  .  05 
level  of  significance. 


This  probability  could  have  been  computed.  The  four 

empty  compartments  could  have  been  selected  in  (^)  or  15  ways. 
The  six  observations  can  occupy  the  remaining  two  boxes  in  the 
following  ways,  the  denominator  of  the  multinomial  expression 
in  each  case  indicating  the  split  of  the  six  observations  between 


the  two  compartments: 


6! 


i:  5' 


+ 


6>. 


2!  4' 


+ 


61 


6: 


3!  3! 


41  2! 


'  ■  6  +  1 5  +  20  +  1 5  +  6  -  62.  Finally,  there  are  *  6^  = 

5,1. 

46,  656  ways  in  which  six  observations  can  be  assigned  to  six  com¬ 
partments  without  restriction  as  to  how  many  are  to  be  empty.  The 

probability  of  exactly  four  empty  compartments  is  therefore  ^ V 

46 , 6  56 


or  .  0199.  By  similar  reasoning,  the  probability  of  exactly  five 

6 


empty  compartments  is 


'  5  ^  6^ 


46,  656  46,  656 


=  ,  0001,  So  the  prob¬ 


ability  of  four  or  more  empty  compartments  is  .  0199  +  .  0001  = 

,  0200.  The  general  formula  for  exactly  Z  empty  compartments 
when  there  are  N  observations  and  N  compartments  is 


N 
(  Z) 


N 


N 


N! 

/  t  *  t  *  t  * 

^  2*  •••  N-z- 


where  the  summation  is  taken  over 


all  values  of  t  ,  t  ,  .  .  .  ,  t  such  that  none  of  the  N-Z  t's  is  zero 
1  2  N-Z 
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and  such  that  the  sum  of  the  t*s  is  N. 


In  the  example  given  the  alternative  hypothesis  was  a 
general  one.  Had  the  experimenter  suspected  that  the  true  and  hy¬ 
pothesized  populations  would  differ  mainly  in  variance,  if  they 
differed  at  all,  David^s  second  test  would  be  more  appropriate. 

In  this  case,  the  hypothesized  distribution  would  have  been  divided 
into  12  equal  areas  and  the  central  six  might  have  been  chosen  as 
test  compartments  if  the  experimenter  suspected  that  the  true  dis¬ 
tribution  had  a  greater-than-hypothesized  variance.  None  of  the 
six  sample  observations  fell  into  any  of  these  six  compartments, 
so  Z  would  have  been  6,  Entering  David^s  tables  for  her  second 
test  with  Z  »  6  and  2N  *12,  a  somewhat  lower  probability  of 
,  0156  is  found,  as  would  be  expected  for  a  test  making  use  of 
additional  ^'information**.  The  second  test,  of  course,  can  be 
significant  for  either  small  or  large  values  of  Z  or  for  both,  de¬ 
pending  upon  the  alternative  hypothesis  and  whether  or  not  it  is 
two  tailed. 

g.  Discussion.  It  is  to  be  noted  that  the  hypothesized 
distribution  must  be  completely  known  prior  to  sampling.  None 
of  its  parameters  should  be  estimated  from  the  sample.  If  this 
stricture  is  observed,  then  each  of  the  N  sample  observations  was 
equally  likely  to  have  been  drawn  from  each  of  the  N  abscissa 
ranges,  as  required  by  the  null  hypothesis,  and  all  N  observations 
could  have  been  drawn  from  any  specified  set  of  ranges  or  com¬ 
partments.  However,  suppose  that  the  distribution  median  is  to 
be  estimated  from  the  sample  median.  Then  the  sample  obser- 

N-1 

vations  cannot  possibly  all  have  been  drawn  from  the  —  left- 

N-  1 

most  or  from  the  — —  rightmost  compartments  of  the  distribution 

whose  median  is  the  same  as  their  own.  It  is  clear,  therefore, 
that  the  mathematical  model  upon  which  the  test  is  based  requires 
that  the  hypothesized  population  distribution  be  completely  known 
in  advance  of  sampling.  To  facilitate  division  into  equal  areas, 
it  is  also  desirable  that  the  distribution  be  extensively  tabled. 

h.  Tables.  Exact  point  probabilities  for  Z  as  well  as 
probabilities  cumulated  to  approximately  the  .  05  level  of  signifi¬ 
cance,  have  been  tabled  (16)  for  3  <  N  <  20  for  the  firsttest  in 
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which  Z  is  the  number  of  unoccupied  abscissa  ranges  each  of  which 


had  probability 


N 


of  containing  any  given  one  of  the  N  sample  obser¬ 


vations.  For  values  of  N  greater  than  20,  Z  is  approximately  norm¬ 
ally  distributed  with  mean  and  variance  given  in  (16),  and  its  prob¬ 
abilities  may  be  obtained  by  referring  the  critical  ratio  to  normal 
tables;  however,  the  calculations  are  laborious.  For  values  of 
N  >  30,  the  author  suggests  that  the  hypothesized  distribution  be 
divided  into  six  or  more  equal  areas,  such  that  N  divided  by  the 
number  of  areas  yields  an  expected  frequency  of  five  or  more  for 
all  compartments,  and  that  the  usual  chi-square  test  of  fit  be  applied. 


Exact  point  probabilities  for  Z,  and  probabilities  cumulated 
to  approximately  the  .  05  level  of  significance,  have  been  tabled  (16) 
for  1  <N  <  10  for  the  second  test  in  which  the  hypothesized  distri¬ 
bution  is  divided  into  2N  equal  areas,  N  of  which  are  selected  as 
"test  compartments",  and  in  which  Z  is  the  number  of  these  test 
compartments  which  are  unoccupied  by  any  of  the  N  observations  in 
the  sample.  Again,  when  N  exceeds  10  Z  is  approximately  normally 
distributed  with  mean  and  variance  given  by  David,  so  probabilities 
can  be  obtained  by  forming  the  critical  ratio  and  referring  it  to 
normal  tables. 


i.  Sources.  16. 


7,  The  Quadrant  Sum  (or  "Corner")  Test  for  Peripheral 
Association 


a.  Rationale.  Suppose  that  an  X  measurement  and  a  Y 
measurement  have  been  taken  on  each  of  2n  objects  and  that  it  is 
desired  to  test  whether  or  not  X  and  Y  are  correlated.  Let  the 
2n  points  be  plotted  as  a  scattergram  and  let  a  vertical  line  be 
drawn  through  the  sample  X -median  and  a  horizontal  line  through 
the  sample  Y-median.  Now  find  the  rightmost  point  in  the  scatter¬ 
gram  and,  proceeding  toward  the  middle  of  the  scattergram,  count 
the  number  of  points  passed  before  the  Y  median  must  be  crossed  to 
pick  up  the  next  point.  This  is  the  largest  value  for  the  number  of 
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rightmost  X-values,  all  of.which  lie  on  one  side  of  the  Y-median. 

Call  this  number  R  if  the  points  are  all  above  the  Y  median  and 

A 

Rg  if  they  are  all  below  it.  Next  find  the  leftmost  point  and 


proceed  analogously,  calling  the  L  leftmost  points  on  one  side 

of  the  Y  median  L  if  they  are  all  above  the  median  and  L  if 

A  ^  B 


they  are  all  below  it.  Now  find  the  uppermost  point  and  proceed 

downward  counting  the  number  of  points  until  the  X  median  must 

be  crossed  to  obtain  the  next  point.  Let  this  number  of  points  be 

A  if  they  are  all  on  the  right  side  of  the  X-median  and  A  if  they 
R  L 


are  all  on  the  left.  Finally,  find  the  lowest  point  and  proceed  up¬ 
ward  and  analogously,  calling  the  number  of  points  B  if  they  are 

R 

all  to  the  right,  and  B  if  they  are  all  to  the  left,  of  the  X  median. 


If  the  X  and  Y  variables  are  correlated,  the  scattergram  points 

should  tend  to  lie  in  one  pair  of  the  diagonal  quadrants  form^^d 

by  the  lines  through  the  X  and  Y  medians,  R  ,  A  ,  L  and 

ARB 

Bj^  all  refer  to  points  in  the  upper  right  or  lower  left  quadrants 


and  are  therefore  given  a  positive  sign.  Likewise,  L  ,  A  , 

A  L 

R  and  B  refer  to  points  in  the  upper  left  or  lower  right  qua- 
B  R 

drants  and  therefore  are  given  a  minus  sign.  The  four  numbers 
actually  recorded,  each  preceded  by  the  proper  algebraic  sign, 
therefore  yield  an  algebraic  sum  which  can  be  used  as  the  test 
statistic.  Consider  the  X  values  to  have  been  ranked  from  1 
to  2n  and  the  Y  values  likewise  to  have  been  ranked  from  1  to 
2n  and  recorded  below  the  X  ranks.  There  are  (2n)!  ways  in 
which  the  Y  ranks  can  be  permuted,  and  each  way  represents 
a  different  set  of  pairings  or  assignments  of  Y  values  to  X  values. 
The  probability  of  a  given  quadrant  sum  or  one  more  extreme  is 
therefore  the  number  of  these  (2n)!  possible  sets  of  assignments 

of  Y  values  to  X  values  which  yield  the  given,  or  more  extreme, 
quadrant  sum,  divided  by  (2n)i  ,  the  number  of  possible  assignments. 
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b.  Null  Hypothesis.  Each  of  the  (2n)!  possible  sets  of  2n 
pairs  of  X  and  Y  values,  which  can  be  made  with  the  obtained  values 
of  X  and  Y,  was  equally  likely  to  have  been  the  set  obtained  as  a 
sample. 


c.  Assumptions.  Sampling  is  random  and  independent 
and  the  sampled  population  is  continuously  distributed,  i.  e.  ,  there 
are  no  tied  values. 


d.  Treatment  of  Ties,  Tied  observations  create  a  prac¬ 
tical  problem  when  they  occur  at  the  ’^crossover  point**,  i.  e.  ,  when 
the  manner  in  which  the  tie  is  broken  affects  the  number  of  extreme- 
most  points.  When  this  occurs,  the  authors  suggest  dividing  the 
number  of  points  in  the  tied  group  which  are  on  the  same  side  of  the 
median  as  the  more  extreme  points,  by  one  plus  the  number  of 
points  in  the  tied  group  which  are  on  the  opposite  side  of  the  median, 
and  counting  the  result  as  the  number  of  **extr ememost**  points  in 
the  tied  group.  They  regard  this  procedure  as  a  conservative  one. 

A  more  conservative  technique  would  be  to  resolve  all  ties  (includ¬ 
ing  extreme  observations  lying  on  the  X  or  Y  median)  in  whatever 
manner  is  least  conducive  to  rejection  of  the  null  hypothesis. 

Efficiency.  No  information  available. 

f.  Application.  Consider  the  following  data,  in  which  the 

points  are  arranged  in  order  of  increasing  X-value:  (15,  71),  (21, 

68),  (23,  75),  (28,  63),  (30,  57),  (33,  59),  (44,  65),  (46,  66),  (49, 

52),  (55,  48).  The  X  median  lies  between  30  and  33,  and  the  Y 

median  between  63  and  65.  The  rightmost  point,  i.  e.  ,  the  largest 

X  value  is  55  and  proceeding  inward  two  points  (55,  48)  and  (49,  52) 

are  counted  before  a  point  is  reached  whose  Y  value  is  on  the  other 

side  of  the  Y  median.  Thus  R  =2  and  R.  *  0.  Likewise  L^  is 

B  A  A 

found  to  be  3  (so  L  is  zero),  since  the  three  points  (15,  71), 

B 

(21,  68)  and  (23,  75)  with  lowest  X  values  all  have  Y  values  above 
the  Y  median  while  the  Y  value  paired  with  the  fourth  largest  X  value 
is  below  the  Y  median.  The  point  with  largest  Y  value  is  (23,  75) 
and  the  points  (15,  71)  and  (21,  68)  have  diminishingly  extreme  Y 

values  and  X  values  on  the  same  side  of  the  X  median,  while  the  fourth 
largest  Y  value,  66,  is  paired  with  an  X  of  46  which  is  on  the  oppo- 
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site  side  of  th^  X  median  from  the  preceding  points.  Therefore 
A  is  three  and  A^  is  zero.  The  points  in  order  of  increasing 

Y  value  are  (55,  48),  (49,  52),  (30,  57)  ...  of  which  the  first 
two  have  X  values  to  the  right  of  the  X  median,  the  third  having 
an  X  value  to  the  left  of  the  median.  Therefore  »  2  and 

B  -  0.  Giving  these  values  the  algebraic  signs  corresponding 
jLj 

to  the  associated  quadrant,  the  quadrant  sum  is  +(R.  +  A  + 

A  R 

L  +  B  )  -  (L  +  A,  +  +  B^)  ^+(0  +  0+  0+0)-(3  +  3  + 

B  L  A  L  B  R 

2  +  2)  =  -10.  Entering  Olmstead  and  Tukey*s  tables  (48),  with 
2n  10  we  find  the  probability  of  a  quadrant  sum  equal  to  or 
more  extreme  than  10  in  absolute  value  to  be  .  0642.  A  null 
hypothesis  of  no  association  could  not  be  rejected  at  the  two- 
tailed  .  05  level.  However  if  the  null  hypothesis  were  that  there 
is  either  no  association  or  a  positive  correlation,  it  could  be  re¬ 
jected  at  the  one-tailed  .  05  level  in  favor  of  the  alternative  hy¬ 
pothesis  of  negative  correlation. 

g.  Discus sion.  The  quadrant  sum  or  ’^corner’*  test  for 
association  obviously  is  especially  sensitive  to  correlation  between 
values  at  their  extremes,  at  least  at  the  extremes  of  one  of  the  two 
variables.  It  tends,  however,  to  ignore  correlation  within  the 
central  portion  of  the  scattergram.  Therefore,  while  providing  an 
excellent  test  for  ‘‘peripheral**  association,  it  is,  as  its  authors 
point  out,  of  “unknown  usefulness**  **when  uniform  attention  to 
the  whole  scatter  diagram  is  desired.  ** 

The  test  can  be  extended  to  test  for  association  between 
more  than  two  variables,  and  its  authors  have  provided  a  small 
table  of  probabilities  for  the  **octant  sum**,  the  test  statistic  in 
the  case  of  three  variables. 

The  four  nonzero  numbers  which  make  up  the  quadrant 

sum  are  not  independent  since  a  single  point  can  be  counted  twice, 

i.  e.  ,  an  extreme -right  point  above  the  Y  median  may  also  be  an 

extreme -high  point  to  the  right  of  the  X  median  and  be  counted  in 

both  R  and  A  .  This  lack  of  independence  could  be  avoided  by 
A  R 
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counting  R  ,  R  ,  L  and  L  first,  then  discarding  these  points 
®  A’  B  A  B 

before  counting  A  ,  A  ,  B  and  B  (or  ignoring  them  in  the  count- 

R  L  R  L 

ing  process).  If  this  were  done  the  **quadrant  sum**  so  obtained 
would  have  a  slightly  different  distribution  than  the  quadrant  sum 
defined  by  Olmstead  and  Tukey.  However,  its  probabilities  could 
now  be  obtained  by  the  methods  outlined  in  the  chapter  on  exceed¬ 
ances.  The  n  points  above  the  Y  median  may  be  regarded  as  the 
first  sample,  the  n  points  below  it  as  the  second.  Let  and 


be  the  most  extreme  leftward  and  rightward  points  above  the  Y 

median  which  are  not  counted  in  L  or  R  .  Then  exceedance 

A  A 


formulae  can  be  used  to  determine  the  a  priori  probability  that  in 

the  second  sample,  i.  e.  below  the  median,  exactly  L  points  will 

B 


have  X  values  smaller  than  X-  and  exactly  R  will  have  X  values 

1  ^  B 

exceeding  X  ,  The  L  +  L^  +  R  +  R  points  can  then  be  dis- 
^  r  A  B  A  B  ^ 

carded  and  an  analogous  procedure  can  be  applied  to  the  two 

**samples**  consisting  of  the  h  points  to  the  left  of  the  X  median 

and  the  2n-L  -L  -R  -R  -h  points  to  the  right  of  it. 

A  B  A  B  ^  ^ 


Since  the  X  and  Y  values  are  independent  under  the  null  hypothesis, 
the  two  probabilities  can  be  combined  to  obtain  an  overall  probability 
for  a  quadrant  sum  whose  components  are  exactly  R  ,  R  ,  L  ,  L  , 


^R- 


the  appropriate  algebraic  signs,  of  course. 


being  added  to  obtain  the  quadrant  sum.  Tabulation  or  calculation 
of  probabilities  in  this  manner  would  be  quite  tedious,  largely  be¬ 
cause  the  same  quadrant  sum  can  be  obtained  in  a  variety  of  ways, 
depending  on  the  values  of  the  eight  components.  The  method  has 
been  outlined  primarily  to  show  the  nearness  of  relationship  of 
the  quadrant  sum  test  to  exceedances  theory. 


h.  Tables.  Exact  two-tailed  probabilities  for  a  quadrant 
sum  equal  to  or  greater  than  k  have  been  tabled  (48)  for  the  cases 
2n  =r  2,  4,  6,  8,  10  and  14  with  asymptotic  probabilities  for  2n  — 
infinity.  These  tables  should  suffice  in  most  cases  since  the  prob- 
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abilities  at  2n  s  14  are  very  close  to  those  for  2n  »  infinity,  except¬ 
ing  those  probabilities  smaller  than  .01.  In  fact  the  probability 
of  a  given  quadrant  sum  is  so  insensitive  to  sample  size  that  the 
authors  have  presented  a  table  of  approximate  probabilities  for 
the  quadrant  sum  which  does  not  use  n  as  a  parameter;  it  is  mere¬ 
ly  stipulated  that  the  table  is  inapplicable  if  the  absolute  value  of 
the  quadrant  sum  equals  or  exceeds  2(2n)  -  6. 

i.  Sources,  48. 


8.  Additional  Tests 

Mood  (38)  has  proposed  a  rank  test  for  dispersion  which 
has  asymptotic  relative  efficiency  of  .76  relative  to  the  F  test  when 
both  tests  are  two-tailed  and  .  87  when  they  are  both  one-tailed. 

If  there  are  m  X-observations  and  n  Y-observations  from  contin¬ 
uous  distributions,  the  observations  are  ranked  from  1  to  m+n 
irrespective  of  sample.  The  test  statistic,  W,  is  the  sum  of  the 
squared  deviations  of  Y  ranks  from  the  average  rank  of  all  obser- 


vations,  i.  e.  ,  W  =  ^  (r.  -  )^,  where  r.  is  the  rank  of 

i=l 

the  Y  observation.  Since  W  can  assume  large  values  due  to 
differences  in  either  location  or  dispersion,  it  must  be  assumed 
that  the  X  and  Y  populations  have  identical  location  parameters. 
The  probability  of  W  under  the  null  hypothesis  is  simply  the  pro¬ 


portion  of  the 


^m+n  ^ 
m 


ways  of  obtaining  m  X-ranks  and  n  Y-ranks 


from  m+n  ranks,  which  give  a  calculated  value  of  W  equal  to  or 
greater  than  that  obtained.  Unfortunately  these  probabilities  do 
not  appear  to  have  been  tabled;  however,  under  the  null  hypothesis, 
W  has  a  mean  of  n  (m+n+l)(m+n- 1 )/ 12  and  a  variance  of  mn(m+n+l) 
(m+n+2)(m+n-2)/ 180  and  is  asymptotically  normally  distributed. 


Another  rank  test  for  dispersion  has  been  proposed  by 
Lehmann  (33)  and  developed  further  by  Sundrum  (57).  Let  m 
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X-observations  and  n  Y-observations  be  drawn  from  continuous 
distributions  and  ranked  from  1  to  m+n.  Then  form  each  of  the 


)  possible  pairs  of  X-ranks  and  each  of  the  (  ^)  possible  pairs 

Li  ^ 

of  Y-ranks.  Finally,  form  each  of  the  {^){^)  quadruples  of  possible 

2  2 


pairs  of  X-ranks  paired  with  possible  pairs  of  Y  ranks,  and  count 
the  number  of  quadruples,  Q,  in  which  both  X-ranks  are  either 
greater  or  smaller  than  both  Y-ranks.  This  number  can  be  ob¬ 
tained  from  a  formula  given  by  Lehmann.  The  probability  that 
the  number  of  such  quadruples  will  be  Q  or  greater,  if  the  null  hy¬ 
pothesis  of  identical  populations  is  true,  is  simply  the  proportion 


of  the 


m 


divisions  of  m+n  ranks  into  m  and  n  ranks  which  yield 


that  value  of  Q  or  a  larger  one.  The  test  is  consistent,  (if  the 
sampled  populations  are  continuous  and  if  ties  are  randomized) 
but  not  unbiassed.  Sundrum  defines  a  statistic 


L  = 


Q 


and  has  tabled  some  of  its  probabilities. 


A  second  **quadruple**  test  for  dispersion  suggested  by 
Lehmann  (33,  page  169)  appears  not  to  be  entirely  distribution- 
free  (38,  page  521). 

A  test  for  dispersion,  somewhat  similar  to  Rosenbaum’s 

(see  Exceedances),  has  been  published  by  Kamat  (28).  Two 

samples  are  drawn  from  populations  assumed  to  be  continuously 

distributed  and  to  have  the  same  location  parameters.  The  n 

X-observations  and  m  Y-observations,  defined  so  that  m  >  n, 

are  ranked  from  1  to  m+n,  in  order  of  magnitude,  irrespective 

of  sample.  The  test  statistic  is  then  D  r.  R  -  R  +  m 

n,  m  n  m 

where  R_  and  R  are  the  ranges  of  the  ranks  of  the  X  and  Y 
n  m  ^ 

observations  respectively.  By  applying  the  Method  of  Random¬ 
ization  to  the  )  ways  of  assigning  n  ranks  to  Xs  and  m  to  Ys, 
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exact  probabilities  can  be  obtained  for  D  .  Probabilities  have 
^  n,  m 

been  tabled  for  values  of  m+n  <  20,  and  a  method  is  outlined  by 
which  to  obtain  approximate  probabilities  when  m+n  >  20. 


If  a  sample  of  size  K  is  drawn  from  a  population  consist¬ 
ing  of  the  ranks  from  1  to  N,  the  sample  mean,  r",  will  be  approx¬ 


imately  normally  distributed  with  mean 


N+1 

2 


and  variance 


1^3! — (1  if  K  is  large. 

12K  '  N  ^ 


Therefore,  Locks  (34)  refers  the 


N+  1 
^ - 

critical  ratio,  -  to  normal  tables  to  test  whether  or 

(T- 

r 

not  a  random  sample  has  been  drawn  from  the  hypothesized  popula- 

IZkT  (r-f)^ 

tion.  He  also  uses  the  statistic  chi-square  = - ^  ^ - 

(K-l)  (N^-1) 

with  K-l  degrees  of  freedom  to  test  whether  sample  variance  and 
population  variance  are  comparable. 

If  the  hypothesized  distribution  is  completely  known  and 
tabled  and  is  continuous,  then  goodness  of  fit  can  be  tested  by 
methods  using  the  probability  integral  transformation  (15,  17,  44, 
49,  50,  51).  If  a  sample  of  N  observations  is  drawn  from  the 
hypothesized  population,  each  sample  observation's  a  priori  prob¬ 
ability  may  be  obtained  from  tables  of  cumulative  probabilities  for 
the  hypothesized  distribution.  Each  observation  may  therefore 
be  regarded  as  an  independent  test  of  significance  and  the  overall 
probability  for  the  N  tests  may  be  obtained  by  the  usual  methods 
of  combining  probabilities  of  independent  tests  of  significance. 

If  random  sampling  is  assumed  the  hypothesis  that  the  observa¬ 
tions  were  drawn  from  a  completely  specified  distribution  may 
be  tested.  Conversely  if  it  is  assumed,  i.  e.  known,  that  the 
observations  came  from  a  specified  distribution,  the  randomness 
of  sampling  may  be  tested.  In  either  case,  no  population  para¬ 
meters  should  be  estimated  from  the  sample;  they  must  be  spec¬ 
ified  in  advance  of  sampling.  Extensive  tables  exist  for  a  test 
statistic,  P  ,  based  on  this  method. 
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An  interesting  and  very  simple  method  of  linear  curve 
fitting  has  been  described  by  Nair  and  his  colleagues  (45,  46), 
One  calculates  the  X  mean  and  Y  mean  for  the  smallest  1/3  of 
the  observations  and  finds  the  point  for  which  they  are  abscissa 
and  ordinate;  he  then  does  likewise  for  the  largest  1  / 3  of  the  ob¬ 
servations  and  draws  a  straight  line  through  these  two  points. 

Further  distribution-free  tests  and  methods  are  merely 
listed  in  the  bibliography.  Some  are  exact  and  provide  tables  of 
probabilities,  but  lack  simplicity  either  conceptually  or  in  appli¬ 
cation.  Others  are  approximations  for  which  there  corresponds 
no  exact  small-sample  probability  formula. 
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CHAPTER  XIV 


TCHEBYCHEFF  INEQUALITIES 


In  1853  Bienayme  discovered,  and  in  1867  Tchebycheff  redis¬ 
covered,  a  mathematical  inequality  variously  called  the  Bienayme- 
Tchebycheff  inequality,  or,  more  frequently,  simply  Tchebycheff? 
inequality.  Following  the  essentials  of  a  derivation  presented  by 
Hoel  (10),  let  f(x)  be  a  continuous  distribution  function  with  finite 

2  r 2 

variance  and  mean  u.  Then  by  definition  cr  =  \  (x-u)  f(x)dx.  This 


integral  can  be  divided  into  three  components  whose  sum  it  equals. 
Thus 


^u-kcr 


(x-u)  f(x)  dx  + 


I 


u+kcr 


u-kcr 


(x-u)^  f(x)  dx  +  r 
%J 


+  00 


u+kcr 


(x-u)  f(x)  dx 


The  second  integral  must  be  positive  if  k  is  positive; 
therefore,  if  k  >  0,  dropping  the  second  integral  must  either  dirn- 
inish  the  value  cdT  the  right-hand  side  of  the  equation  or  else  leave 
it  unaffected.  Thus 

2  p  u-kcr  2  p  +  «>  2 

cr  >  \  (x-u)  f (x)  dx  +  \  (x-u)  f(x)  dx. 

-eo  ^  u+kcr 


For  the  first  integral,  of  all  the  values  of  x  between  -oo  and  u-kcr, 

2 

that  which  will  make  (x-u)  smallest  is  that  which  is  closest  to  u, 
namely  u-kcr.  Similarly,  for  the  second  integral  that  possible 

2 

value  of  X  which  minimizes  (x-u)  is  u+kcr.  The  inequality  must 
hold  therefore  if  these  values  are  substituted  for  x  in  the  coefficient 
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(x-u)  .  Therefore 
2 


^  u-ktr 

2 

^  -h  00 

c 

(u-k(r-u)  f(x)dx 

+  \ 

J  -00 

^  u+ko" 

pu-kcr 

k^  cr^  f(x)  dx  +  r 

k  cr 

J 

u+ktr 

r  ^  u-ktr 

k^cr^ 

\  f(x)  dx  + 

V 

1  oo 

u+kcr 

(u+kcr-u)  f(x)dx 


The  first  integral  will  be  recognized  as  the  probability  that  x  will 
be  smaller  than  u-kcr,  and  the  second  integral,  the  probability  that 
X  will  exceed  u+kcr.  The  inequality  can  therefore  be  written 


2  ^  1.2  2 
O'  >  k  (T 


>  k  0" 


>  k  0" 


or  finally, 


(x  <  u-ko")  +  (x  >  u-fko“)j 
^  P^  (x-u  <  -ko")  +  P^  (x-u  >  ko’) 

Pj.  (  I  x-u  I  >  ko-)  , 

(  I  x-u  I  >  kcr)  <  l/k^  . 


This  is  Tchebycheff's  inequality,  which  simply  states  that 
the  probability  is  equal  to  or  less  than  l/k2  that  a  randomly  drawn 
sample  observation  will  lie  farther  than  k  population  standard  devia¬ 
tions  from  the  population  mean.  It  can  be  applied  to  an  entire  sam¬ 
ple  of  n  observations  by  substituting  the  sample  mean  for  x  and  the 
true  standard  error  of  the  mean  for  cr.  The  statement  then  becomes: 

2 

the  probability  does  not  exceed  1/k  that  the  mean  of  a  random  sample 
will  lie  farther  than  k  standard  errors  of  the  mean  from  the  popula¬ 
tion  mean. 

The  inequality  uses  both  the  mean  and  variance  of  the  popula¬ 
tion,  If  either  is  known,  it  can  be  substituted  into  the  inequality 
along  with  an  hypothesized  value  for  the  other  parameter  and  the 
observed  value  x.  The  inequality  can  then  be  used  to  test  the  hy¬ 
pothesis  which  determined  the  value  substituted  for  the  unknown 
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2 

parameter.  The  significance  level,  oc,  must  equal  l/k  ,  so 

k  =  — —  and  the  hypothesis  is  rejected  at  the  cx  level  of  significance 
\J  cc 

if  1^*"^  1  >  ill  .  Again,  x  and  cr-  can  be  substituted  for  x  and 

cr  to  make  the  test  applicable  to  samples  of  more  than  one  obser¬ 
vation,  Obviously,  the  inequality  can  also  be  used  for  prediction 
and  for  the  setting  of  tolerance  limits, 

Tchebycheff’ s  inequality  suffers  from  a  number  of  deficiencies. 
First  it  is  distribution-free  only  in  the  limited  sense  that  it  does  not 
completely  specify  the  shape  or  contour  of  the  distribution  to  which 
it  is  to  apply.  It  does,  however,  require  a  knowledge  of  the  popu¬ 
lation  variance,  (or  true  standard  error  of  the  mean),  which  is 
seldom  available  in  the  absence  of  knowledge  of  the  distribution’s 
form.  Second,  it  is,  by  nature,  inexact;  the  only  explicit  prob¬ 
ability  statement  that  can  be  made  concerns  the  upper  boimd  for 
a  probability  rather  than  the  probability  itself.  Finally,  it  is  a 
weak  test  in  that,  when  applied  to  small  samples,  it  is  generally 
unlikely  to  reject  a  false  null  hypothesis  unless  the  hypothesis  is 
spectacularly  in  error;  small  discrepancies  between  null  hypo¬ 
thesis  and  true  condition  are  usually  detected  only  when  extremely 
large  samples  are  taken.  This  last  shortcoming  could  have  been 
predicted  on  the  basis  of  the  derivation.  The  central  term 

p  u+ko*  2 

\  (x-u)  f(x)  dx,  was  completely  discarded  and  the  values 

^  u-ko* 

of  X  which  would  minimize  the  two  remaining  integrals  ^ere  sub¬ 
stituted  into  them.  The  net  result  is  that  the  term  a*  on  the 
left  of  the  inequality  sign  is,  in  all  probability,  much  greater  than 
the  sum  of  the  terms  constituting  the  right  hand  side  of  the  inequal¬ 
ity,  The  weakness  of  the  test  could  also  be  predicted  on  the  basis 
of  the  fact  that  an  otherwise  unknown  distribution  is  poorly  des¬ 
cribed  by  a  mere  knowledge  of  its  variance  or  mean  or  any  other 
single  parameter. 

Despite  these  weaknesses,  the  inequality  has  as  much  strength 
as  can  be  obtained  under  the  assumed  conditions.  That  is  to  say, 
the  discrepancy  between  the  values  on  opposite  sides  of  the  inequality 


359 


sign  cannot  be  reduced  without  the  imposition  of  further  restric- 
tions(6).  Many  *^stronger*^  Tchebycheff-like  inequalities  have  been 
developed  at  the  cost  of  introducing  more  and  more  elaborate  as¬ 
sumptions  about  the  population  distribution  (such  as  requiring  that 
the  distribution  be  unimodal  or  symmetrical  or  that  it  increase 
monotonically  in  progressing  from  its  tails  to  its  mode,  etc,  ), 

This  means,  of  course,  that  the  proper  use  of  such  inequalities  is 
restricted  to  populations  about  whose  distributions  more  and  more 
is  known.  It  seems  to  be  in  the  nature  of  inequalities,  therefore, 
that  strength  and  freedom  from  assumptions  are  inversely  related. 

Despite  its  weakness  as  a  statistical  test,  Tchebycheff's  in¬ 
equality  has  played  an  important  part  in  the  mathematical  develop¬ 
ment  of  probability  theory.  It  has  been  extended  to  bivariate  (1, 

3,  7)  and  multivariate  (4,  7)  distributions  and  has  been  mathe¬ 
matically  ^generalized"  so  as  to  include  a  wide  variety  of  inequal¬ 
ities  as  special  cases.  It  is  involved  in  many  important  statisti¬ 
cal  derivations.  However,  it  is  rarely  used  now  as  a  statistical 
test.  Pearson  (18)  sums  up  what  is  probably  still  the  prevailing 
attitude  toward  Tchebycheff  inequalities  as  statistical  tests:  "On 
the  whole  we  must  express  disappointment  at  the  results  of  Tche- 
bycheff's  process.  We  had  found  Tchebycheff ^s  own  limit  based 
on  the  second  moment  of  small  practical  value,  although  it  is  to 
be  found  occupying  a  prominent  position  in  many  continental  works 
on  probability.  By  extending  it  to  higher  moments  and  product- 
moments  we  have  reached  results  which  are  great  improvements 
on  the  original  Tchebycheff  limit,  but  the  method  still  lacks  the 
degree  of  approximation  (except  for  probabilities  over  .  99,  say) 
which  would  make  the  results  of  real  value  in  practical  statistics,  " 
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CHAPTER  XV 


EXTREME  VALUE  DISTRIBUTIONS 


The  distribution  of  the  largest,  or  smallest,  value  in  a  sample 
of  n  observations  has  been  investigated  by  Fisher  and  Tippett  (4), 
Gumbel  (5-9,  15)  and  others  (2,  13,  14).  These  investigations 
have  met  with  qualified  success:  the  distribution  of  an  extreme 
sample  value  has  been  obtained  for  samples  of  infinite  size  from 
certain  types  or  classes  of  population.  Gumbel  has  investigated 
and  tabled  (9,  15)  extreme  values  and  near -extreme  values  for 
large  samples  from  populations  whose  distribution  is  of  the  expo¬ 
nential  type,  **which  covers,  among  others,  the  exponential,  the 
normal,  and  the  chi-square  distribution.  **  Extreme  value  distri¬ 
butions  find  important  application  in  predicting  the  “return  period** 
for  floods  and  other  meteorological  phenomena,  and  in  strength  of 
materials  investigations  since  it  is  the  weakest  of  n  **fibers**,  the 
worst  of  n  flaws,  or  the  heaviest  of  n  loads  which  determine  when 
and  where  fracture  will  begin. 

K  a  very  large  sample  is  taken,  the  correlation  between  the 
largest  and  smallest  sample  values  becomes  negligible  (5)  and  the 
sample  extremes  may  be  regarded  as  effectively  independent. 

Under  these  circumstances  the  distribution  of  the  sample  range 
can  be  obtained  from  the  joint  distribution  of  the  two  extremes. 

The  distribution  of  the  range,  obtained  in  this  way,  necessarily 
incorporates  all  assumptions  made  in  obtaining  the  distributions 
for  the  extremes.  Gumbel  (6)  has  tabled  probabilities  for  ranges 
and  range-like  statistics  for  samples  from  an  **unlimited  symmetri¬ 
cal  initial  distribution  of  the  exponential  type.  ** 

It  is  clear  that  the  extreme  value  statistics  discussed  above 
and  range  statistics  derived  from  them  are  completely  valid  only 
for  infinitely  large  samples.  Furthermore  they  are  distribution- 
free  only  in  the  very  limited  sense  that  the  form  of  the  underlying 
population  need  not  be  known  fully  but  rather  need  be  known  only 
to  the  degree  necessary  positively  to  categorize  it  as  belonging  to 
a  certain  specified  class  of  populations.  Such  restrictions  place 
these  statistics  outside  the  scope  of  this  report  and  no  attempt  will 
be  made  to  describe  them  in  detail. 
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CHAPTER  XVI 


OBTAINING  AN  OVERALL  PROBABILITY  FOR  SEVERAL 

INDEPENDENT  TESTS 


It  is  sometimes  desirable  to  obtain  an  overall  probability 
for  a  number  of  separate  and  independent  tests  of  the  same  null 
hypothesis.  It  may  be  that  conducting  an  additional  single  test 
upon  the  aggregate  data  is  justifiable  theoretically  but  not  prac¬ 
tically  because  it  would  require  excessive  labor  or  delay.  Or  it 
may  be  that  the  data  cannot  properly  be  combined.  For  example, 
one  test  may  have  been  a  t-test  for  matched  pairs,  another  a  t-test 
without  matching,  a  third,  the  sign  test,  etc. 

What  is  desired  is  the  probability  of  acquiring  by  chance  a 
set  of  test  outcomes  as  extreme  as,  or  more  extreme  than,  those 
actually  obtained.  This  overall  probability  is  not  the  product  of 
the  probabilities  of  the  individual  tests.  To  illustrate,  if  each  of 
five  tests  yields  results  at  the  ,  50  level,  the  product  of  the  five 
probabilities  is  .  03125,  although  it  is  clear  that  in  combination  the 
five  tests  are  even  less  suggestive  of  a  false  null  hypothesis  than 
they  are  individually.  Probabilities  can  range  from  zero  to  one. 
Each  time  a  probability  is  added  to  a  set  of  probabilities,  the  pro¬ 
duct  of  the  probabilities  must  diminish  (or  remain  the  same  if  the 
added  probability  is  1), 

The  not  uncommon,  but  fallacious,  belief  that  the  overall 
probability  for  a  set  of  test  outcomes  is  expressed  by  the  product 
of  the  individual  probabilities  is  apparently  due  to  misinterpreta¬ 
tion  of  compound  probability.  If  events  A,  B,  C,  yield  the  com¬ 
pletely  independent  outcomes  a,  b,  c,  whose  individual  chance 

probabilities  are  p  >  p,  >  p  >  then  the  product  p  p,  p  gives  the 

a  b  c  a  b  c 

a  priori  probability  that  outcome  a  will  result  for  event  A,  out¬ 
come  b  will  result  for  event  B,  and  outcome  c  will  result  for 
event  C.  Such  a  procedure  is  invalid  for  the  combination  of 
test  probabilities  for  two  reasons.  First,  we  are  not  interested 

in  the  probability  that  test  A  will  yield  probability  p  ,  test  B  will 

a 

yield  probability  p  ,  and  test  C  will  yield  probability  p  .  Rather 

D  c 

we  are  interested  in  the  probability  that  the  probabilities  p  >  p  , 

a  b 
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will  be  obtained,  each  probability  applying  to  some  unspecified 

one  of  the  three  tests.  Second,  test  probabilities  are  cumulative 
probabilities  and  therefore  do  not  express  the  probability  of  a 
single,  obtained,  outcome,  but  rather  the  probability  of  the  ob¬ 
tained  outcome  plus  the  probabilities  of  all  of  a  defined  class  of 
less  "expected",  and  unobtained,  outcomes.  Since  all  of  the  out¬ 
comes  referred  to  by  the  smallest  test  probability  are  also,  in  a 
sense,  referred  to  by  a  portion  of  each  of  the  other  test  probabilities, 
the  requirement  of  independence  has  not  been  met.  Multiplying 
test  probabilities  therefore  not  only  does  not  give  us  the  probability 
we  seek;  it  is  not  even  a  valid  procedure  for  obtaining  a  probability 
which  we  do  not  seek. 

There  are  several  methods  of  obtaining  overall  probabilities. 
Each  requires  that  the  component  tests  be  independent  and  test  the 
same  null  hypothesis.  The  requirement  of  independence  means 
that  if  the  null  hypothesis  is  true  there  is  no  common  underlying 
factor  in  any  of  the  data  upon  which  the  individual  tests  are  based 
which  would  tend  to  produce  similar  test  outcomes.  Specifically 
this  means  that  unless  the  tests  are  statistically  independent 
(which  is  usually  not  the  case)  they  must  have  been  conducted  upon 
separate  and  nonoverlapping  sets  of  data  yielded  by  separate  and 
nonoverlapping  groups  of  subjects  (unless  the  null  hypothesis  is 
confined  to  the  population  of  tested  subjects).  The  reason  for  the 
requirement  that  the  individual  tests  must  test  the  same  null  hypo¬ 
thesis  is  obvious. 

The  rationale  of  the  binomial,  or  Wilkinson,  method  is  as 
follows.  If,  before  collecting  data,  it  is  decided  to  use  the  same 
significance  level,  oc,  for  each  of  N  independent  tests,  then  each 
test  must  have  one  of  two  outcomes:  significance  or  insignificance. 
Significance  is  therefore  binomially  distributed,  with  probability  oc 
on  a  single  trial.  The  probability  that  n  or  more  of  N  independent 
tests  will  yield  probabilities  falling  within  the  significance  level  cc 


This  probability  can  be  obtained 


is  then 


from  tables  of  cumulative  binomial  probabilities  or  from  tables 
(19)  or  graphs  (17)  designed  expressly  for  this  purpose,  A  less 
desirable  solution  is  afforded  by  the  normal  approximation  to  the 
binomial.  This  is  justified  only  in  those  rare  cases  for  which 
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the  binomial  tables  are»not  sufficiently  extensive.  The  normal  tables 


are  entered  with  the  critical  ratio 


|n  -  Nocj  -.5 

sINcc  (1-oc) 


a  one-tailed  test 


being  conducted,  with  small  probabilities  corresponding  to  values 
of  n  greater  than  Ncr,  The  approximation  cannot  be  expected  to  be 
good  if  N  is  small  (say  less  than  20)  or  if  Noc  is  less  than  5. 

The  binomial  method  presupposes  that  the  size  of  the  rejection 
region,  i.e,  the  significance  level,  for  each  test  was  selected  prior 
to  collection  of  data,  and  that  the  same  level  oc  was  selected  for  each 
test.  The  selection  of  oc  prior  to  collection  of  data  insures  an  absence 
of  a  posteriori  bias  in  obtaining  an  overall  significance  level.  The 
binomial  method  also  requires  that  each  of  the  N  tests  be  capable  of 
of  an  outcome  whose  cumulative  probability  ig  exactly  oc.  That  is  to 
say,  the  test  statistic  need  not  be  continuously  distributed,  but,  if  not, 
it  must  have  a  discrete  value  corresponding  to  exactly  the  oc  level 
of  significance  not  simply  falling  within  the  level  oc.  Otherwise  the 
binomial  method  would  be  inaccurate  in  the  direction  of  conservatism: 
it  would  fail  to  announce  significance  as  frequently  as  it  occurred. 

If  the  experimenter  knows  that  nonchance  values  of  the  test 
statistic  can  only  fall  on  one  tail  of  its  distribution,  or  if  he  is  only 
interested  in  nonchance  results  falling  on  a  specified  tail,  he  will 
use  the  one-tailed  oc  level  of  significance  for  all  N  tests  and  the  bino¬ 
mial  method  will  be  highly  appropriate  for  their  combination.  How¬ 
ever,  if  so  far  as  the  experimenter  knows,  nonchance  results  can 
fall  on  either  one  of  the  two  tails,  and  if  he  is  interested  in  both  event¬ 
ualities,  the  binomial  method  becomes  ambiguous  in  interpretation. 

Ordinarily  one  uses  a  two-tailed  test  when  one  is  unable  to  pre¬ 
dict  the  direction  of  nonchance  results.  When  a  single  test  is  con¬ 
ducted  this  is  a  reasonable  and  uncomplicated  procedure.  However, 
even  though  the  experimenter  may  be  unable  to  specify  on  which 
"side**  of  a  false  null  hypothesis  the  true  condition  will  lie,  it  can¬ 
not  lie  on  both  sides,  and  the  result  wauld  therefore  be  highly  am¬ 
biguous  if  probabilities  for  two-tailed  tests  were  ‘^combined**  by 
the  binomial  method.  If  he  uses  the  two-tailed  oc  for  each  of  his 
N  tests,  the  binomial  method  still  tells  him  precisely  the  probability 
that  n  of  his  tests  will  yield  probabilities  within  the  two-tailed  oc 
region  by  chance.  However,  the  usual  supposition  that  if  the  chance 
probability  is  small  the  tested  effects  must  be  due  to  some  nonchance 
factor  may  become,  in  this  case,  a  non  sequitur.  For  example, 
the  chance  probability  that  of  28  tests,  4  or  mo^e  will  yield  results 
significant  at  the  two-tailed  .05  level  is  .049,  and  an  experimenter 
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using  the  .05  level  for  his  overall  significance  level  would  reject 
the  null  hypothesis.  However,  what  is  the  ’’true*'  hypothesis  if  two 

of  the  four  significant  tests  fell  in  the  oc  region  and  the  other  two 

fell  on  the  opposite  tail  in  the  1  -  oc  region?  The  experimenter 

obviously  cannot  properly  select  a  one-tailed  significance  region 
in  advance  of  all  data  collection  if  he  does  not  know  in  what  direction 
to  expect  departures  from  the  hypothesized  condition.  On  the  other 
hand,  if  he  selects  a  one-tailed  region  on  the  basis  of  examination 
of  the  data,  he  is  guilty  of  introducing  an  a  posteriori  bias,  and  the 
alleged  overall  probability  for  his  results  will  not  be  the  true  prob¬ 
ability.  If  he  insists  upon  combining  two-tailed  tests,  he  will  be 
able  to  make  a  precise  probability  statement  about  a  chance  event, 
which  probability  has  little  bearing  on  the  nonchance  event  in  which 
he  is  really  interested. 

The  Chi-Square,  or  Fisher,  method  gives  the  probability  of 
obtaining  a  certain  product  for  the  one-tailed  cumulative  probabilities 
of  several  tests.  While  the  overall  probability  of  a  series  of  tests  is 
not  expressed  by  the  product  of  their  separate  probabilities,  that 
product  has,  itself,  a  probability  of  occurrence  which  can  be  regarded 
as  the  overall  probability  for  the  series  of  tests. 

Expanding  a  treatment  and  derivation  given  more  concisely  by 
Wallis  (18),  let  N  be  the  number  of  tests  whose  probabilities,  p^^,  p^ 

.  .  .  ,  Pj^,  are  to  be  combined.  Let  each  test  be  capable  of  yielding 

any  cumulative  probability  between  0  and  1,  each  value  being  equally 
likely,  i.  e.  ,  assume  each  test  statistic  to  be  continuously  distributed. 
Then  the  sample  space  for  the  product  P2  •  •  •  •  P^  =  k  is  a  square 

when  N  =  2,  a  cube  when  N  =  3,  and  an  equal  sided,  N-dimensional 
solid  when  N  >  3.  When  N  =  2,  the  probability  that  the  product  P^  P2 

does  not  exceed  some  value  k  is  that  area  in  a  square  of  unit  edge 
which  lies  on  the  convex  side  of  the  equilateral  hyperbola  P]^P2  “  ^ 

(See  Figure  5).  It  is  therefore  one  minus  the  area  enclosed  by  the 

hyperbola.  The  enclosed  area  is  \  (l-p^^)  dp^*  Substituting 

k 
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AREA  OF  CROSS-SECnON  •  I-  f  f  +  f  f  LOG^  f  ^ 


B  A'[.-LO«,A'+  ttO|^] 


Figure  5 
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)  or  1  -  k  +  k  Ink,  and 


k/p^  for  this  becomes  I  ("P2  -  I' 


dp 


the  desired  probability  is  1  minus  this  or  k(l-lnk).  When  N  =  3, 
the  desired  probability  is  the  volume  of  a  unit  cube  minus  that  por¬ 
tion  whose  cross  section  is  1  "  P2  P  i  P2  above) 

and  whose  perpendicular  dimension  extends  from  p^  =  k  to  p^  =  1. 

The  volume  to  be  subtracted  is  ^  (l-Pj^  P2  +  Pj  P2  Pj  P2^  ^P3 
substituting  k/p^  for  p^^p^, 


Jlr 


V  k  k  r  ^  k 

(1 - +  — —  In  -- —  )  dp-  =  l-k  +  klnk+  \  k  — —  In 

Po  Pc,  P3  3  J  p  p 


by  parts,  becomes 


kL 


the  remaining  integral,  integrated 

-2 


k  In 


-  In  p- 

Pc,  ^3 


pi  Po 

-  k  \  (In  p  )  (  - —  )dp. 

kp3-^ 


-k  p, 


=  k 


I 


1  dp. 


k  ^3 


— ^ —  In  p_  =  k 
P-  3 


(lnp3)' 
- 2 - 


-k(lnk) 

- 2 - 


The  subtracted 


volume,  then,  is  1-k  +  klnk- 


k(ln  k) 


.  And  the  desired  prob- 


r  Hn  k)^ 

ability  is  k  I  1  -  In  k  +  « — - — (  •  The  general  term  for  the  prob¬ 


ability  of  the  product  N  independent  tests  is,  then 

y, 


N-l  ,  ■,  i»r  IT-  f. 

k  )  — —  which  can  be  written  as  y  e  ^  (-Ink) 


r=0 


N-l 
r=0 


r ; 


which  is  the  sum  of  the  first  N  terms  of  the  Poisson  distribution 
whose  mean  is  -Ink.  However,  it  is  known  that  the  probability 
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for  a  value  of  x  based  on  ZN  degrees  of  freedom  is  given  by  the  sum 

2 

of  the  first  N  terms  of  the  Poisson  whose  mean  is  x  /2.  Therefore 

YN-1  e^”N-lnk)^ 

the  probability  /  - p - ’ —  of  the  product  k  is  also  the 

r=0 


probability  of  that  value  of  x  based  on  2N  degrees  of  freedom  for  which 
2  , 

X  /2  =  -Ink,  Stated  differently,  when  based  on  2N  degrees  of  freedom 
2 

X  =  -2  Ink  has  the  probability  we  seek. 

Therefore  to  obtain  the  overall  probability  for  N  tests  whose 

2 

separate  probabilities  yield  the  produce  k,  enter  the  x  tables  with 

2 

2N  degrees  of  freedom  and  find  the  probability  for  the  value  of  x 
equal  to  -2  Ink,  This  is  the  probability  of  the  product  k. 

An  alternative  and  equivalent  method  does  not  require  the 
evaluation  of  logarithms.  As  mentioned  earlier,  the  probability  for 

2 

for  a  value  of  x  with  2N  degrees  of  freedom  is  the  sum 

2  9 

X  2 


r=0 


.  For  two  degrees  of  freedom,  N=1  and  the 


probability  becomes  simply  e  T  .  Solving  p  =  e  T  for  x  >  'we 


have  -  =  Inp  or  x  =  -2  Inp.  That  is  to  say,  the  value  of  any 

2 

X  based  on  two  degrees  of  freedom  is  minus  twice  the  natural  log¬ 
arithm  of  its  own  probability.  Phrased  differently,  one  can  obtain 
minus  twice  the  natural  logarithm  of  any  probability  by  entering 
the  chi-square  tables  with  that  probability  and  with  two  degrees  of 
freedom  and  reading  off  the  corresponding  value  of  chi  square. 
Suppose  this  is  done  for  each  of  the  N  probabilities  for  which  the 
overall  probability  is  sought.  Then  for  each  probability  p^,  we 

2 

obtain  a  for  2  d.f.  =  -2  Inp^.  Because  of  the  additive  property 
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2  2 

of  X  »  these  values  of  x  based  on  two  degrees  of  freedom  can  be 

2 

summed  to  give  a  total  value  of  x  based  on  the  sum  of  the  separate 
degrees  of  freedom. 

X?  for  2  d.  f.  =  -2  In  p. 

'^1  1 


for  2  d,  f .  )  =  ^  -  2  In  p^=  “  ^  / 

i^l  i^l 


In  p. 
•^1 


total 


2 

X 


based  on  2N  d. f. 


-2(lnpj+  Inp^-f. .  .  Inpj^) 


=  -2  ln(p^P2  .  . .  Pj^  =  -2  Ink. 


Therefore,  the  total  x  based  on  2N  degrees  of  freedom  has  precisely 

2 

the  probability  we  seek,  and  this  total  x  has  been  obtained  without 

2 

resort  to  any  tables  of  logarithms.  Extensive  tables  of  x  ior  two 
degrees  of  freedom  (8)  have  been  provided  for  use  with  this  method. 
Graphs  (1,  2)  exist  which  give  the  probability  of  the  product  of  two 
probabilities.  In  using  the  chi  square  method,  Yates*  correction 
should  never  be  applied  as  it  is  completely  inappropriate.  Also, 
each  of  the  individual  probabilities  to  be  combined  must  be  contin¬ 
uously  distributed,  i.  e.  ,  the  '^population**  probability  must  be  capable 
of  assuming  any  value  between  zero  and  one.  This,  in  turn,  means 
that  the  test  statistic  must  be  continuously  distributed,  which  elim¬ 
inates  many  distribution-free  tests.  If  the  test  statistic  is  capable 
of  assuming  a  large  number  of  different  values,  however,  the  tech¬ 
nique  may  be  used  as  an  approximate  method.  Another  require¬ 
ment  of  the  chi-square  method  is  that  the  probabilities  to  be  com¬ 
bined  must  be  exact  cumulative  probabilities,  not  simply  "signi¬ 
ficance  levels**  within  which  the  cumulative  probability  has  fallen. 
Thus  the  experimenter  must  have  available  tables  of  exact  cumu¬ 
lative  probabilities  for  each  of  the  test  statistics  whose  probabilities 
are  to  be  combined;  tables  giving  the  values  of  the  test  statistic 
at  the  conventional  significance  levels  such  as  .  10,  .05,  .01,  ,001 
will  not  suffice  unless  linear  interpolation  is  performed  and  unless 
it  yields  very  nearly  exact  values.  A  further  requirement  is  that 
the  cumulative  probabilities  used  for  the  individual  tests  must  all 
be  one-tailed  probabilities,  with  those  probabilities  near  zero  all 
implying  the  same  type  of  departure  from  the  hypothesized  condition 
and  with  those  probabilities  near  one  all  implying  the  opposite  type 
of  departure.  If  the  experimenter  wishes  to  conduct  a  two-tailed 
overall  test  at  the  significance  level  oc,  he  simply  rejects  the  null 
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hypothesis  if  the  product  k  is  either  so  small  that  its  probability  is 

less  than  ioc  or  so  large  that  its  probability  is  greater  than  1 

Thus  the  chi-square  method  is  free  of  the  ambiguity  surrounding  the 
binomial  method  when  a  two-tailed  overall  test  is  required. 

Wallis  (18)  has  outlined  the  method  of  obtaining  the  probability 
of  a  product  of  individual  probabilities  when  some  of  them  are  dis¬ 
cretely  distributed. 

There  are  other  methods  of  obtaining  overall  probabilities. 

A  technique  somewhat  similar  to  that  described  as  the  binomial 
method  is  attributed  (3,  p.  562)  to  Tippett.  A  technique  (5,  6,  12, 

13,  14,  15,  16)  which  is  essentially  the  chi-square  method  was 
discovered  subsequently  but  independently  by  Karl  Pearson.  Birn- 
baum  (3)  states  that  ”no  single  method  of  combining  independent  tests 
of  significance  is  optimal  in  general,  and  hence  .  .  .  the  kinds  of  tests 
to  be  combined  should  be  considered  in  selecting  a  method  of  com¬ 
bination.  Various  methods  are  examined  by  him  in  (3),  the  two 
methods  described  above,  i.e.  ,  the  binomial  method  and  the  chi-square 
method  apparently  being  most  effective  in  the  generality  of  applications. 
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SUMMARY 


Two  methods  have  been  described  for  obtaining  an  overall 
probability  for  the  outcomes  of  a  set  of  statistical  tests,  using 
as  "data’'  the  obtained  cumulative  probabilities  for  each  of  the 
individual  outcomes.  Both  methods  require  that  the  component 
tests  be  independent  and  test  the  same  null  hypothesis. 


The  binomial  method  gives  the  probability  that  of  N  tests,  n 
or  more  will  yield  cumulative  probabilities  falling  within  a  pre- 
designated  significance  level  cc.  The  individual  test  statistics 
neednot  be  continuously  distributed;  however,  each  must  have 
a  value  corresponding  to  a  cumulative  probability  of  exactly  cx:. 
The  binomial  method  is  highly  appropriate  when  the  individual 
tests  to  be  combined  are  one-tailed,  and  a  one-tailed  overall 
test  of  the  null  hypothesis  is  required.  If  oc  is  taken  as  a  two- 
tailed  significance  level,  the  binomial  method  remains  mathe¬ 
matically  valid,  giving  the  chance  probability  of  the  obtained 
results.  However,  small  chance  probabilities  can  no  longer  be 
taken  as  presumptive  evidence  that  the  null  hypothesis  is  false, 
since  they  do  not  necessarily  imply  the  existence  of  a  more  like¬ 
ly  alternative.  If  nearly  equal  proportions  of  the  n  significant 


tests  fall  on  opposite  tails  i  oc  and  1  -  ^^9  then  rejection  of  the 


null  hypothesis  is  unjustified  since  no  alternative  hypothesis  accounts 
for  the  results  any  better  than  does  the  null  hypothesis. 


While  the  overall  probability  of  a  series  of  tests  is  not  ex¬ 
pressed  by  the  product  of  their  separate  probabilities,  that  product 
has,  itself,  a  probability  of  occurrence  which  can  be  regarded  as 
the  overall  probability  for  the  series  of  tests.  The  chi-square 
method  gives  the  cumulative  probability  for  the  product  of  the  one- 
tailed  cumulative  probabilities  of  N  tests.  It  requires:  (a)  that 
the  individual  test  statistics  be  continuously  distributed,  i.  e.  , 
that  every  cumulative  probability  from  zero  to  one  be  equally 
likely,  (b)  that  one-tailed  cumulative  probabilities  be  used  for 
the  individual  tests,  and  that  a  cumulative  probability  on  a  given 
side  of  ,  50  imply  the  same  direction  of  deviation  from  for  every 

test,  (c)  that  for  each  test  the  exact  cumulative  probability  be 
used,  not  simply  the  "significance  level"  within  which  that  cumu- 
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lative  probability  fell.  The  overall  test  of  significance  may  be 
made  two-tailed  at  the  oc  level  of  significance  by  rejecting  the 
null  hypothesis  if  the  one-tailed  cumulative  probability  of  the 


product  falls  either  between  zero  and  ioc  or  between  1  -  ^oc  and  1. 

Ld  ^ 


For  specific  cases,  the  following  table  maybe  helpful  in  de¬ 
ciding  which  of  the  two  methods  is  appropriate. 


Restrictive  Conditions 

Method 

Binomial 

Chi-Square 

Continuously  distributed  test  statistics  required? 

No* 

Yes 

Exact  cumulative  probabilities  required? 

No* 

Yes 

Two-tailed  tests  are  ambiguous  ? 

Yes 

No 

See  text  for  qualifications. 
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